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Preface 



No applied mathematician can be properly trained without some basic un- 
derstanding of numerical methods, i.e., numerical analysis . And no scientist 
and engineer should be using a package program for numerical computa- 
tions without understanding the program’s purpose and its limitations. 
This book is an attempt to provide some of the required knowledge and 
understanding. It is written in a spirit that considers numerical analysis 
not merely as a tool for solving applied problems but also as a challenging 
and rewarding part of mathematics. The main goal is to provide insight 
into numerical analysis rather than merely to provide numerical recipes. 

The book evolved from the courses on numerical analysis I have taught 
since 1971 at the University of Gottingen and may be viewed as a successor 
of an earlier version jointly written with Bruno Brosowski [10] in 1974. It 
aims at presenting the basic ideas of numerical analysis in a style as concise 
as possible. Its volume is scaled to a one-year course, i.e., a two-semester 
course, addressing second-year students at a German university or advanced 
undergraduate or first-year graduate students at an American university. 

In order to make the book accessible not only to mathematicians but 
also to scientists and engineers, I have planned it to be as self-contained as 
possible. As prerequisites it requires only a solid foundation in differential 
and integral calculus and in linear algebra as well as an enthusiasm to see 
these fundamental and powerful tools in action for solving applied prob- 
lems. A short presentation of some basic functional analysis is provided in 
the book to the extent required for a modern presentation of numerical 
analysis and a deeper understanding of the subject. 




vi Preface 



An introductory book of a few hundred pages cannot completely cover 
all classical aspects of numerical analysis and all of the more recent devel- 
opments. I am willing to admit that the choice of some of the topics in the 
present volume is biased by my own preferences and that some important 
subjects are omitted. 

I was taught numerical analysis in the mid sixties by my thesis adviser, 
Professor Erich Martensen, at the Technische Hochschule in Darmstadt. 
Martensen’s perspective on teaching mathematics in general and numeri- 
cal analysis in particular had a great and long-lasting impact on my own 
teaching. Therefore, this book is dedicated to Erich Martensen on the oc- 
casion of his seventieth birthday. 

I would like to thank Thomas Gerlach and Peter Otte for carefully read- 
ing the book, for checking the solutions to the problems, and for a number 
of suggestions for improvements. Special thanks are given to my friend 
David Colton for reading over the book for correct use of the English lan- 
guage. Part of the book was written while I was on sabbatical leave at the 
Department of Mathematical Sciences at the University of Delaware and 
the Department of Mathematics at the University of New South Wales. I 
gratefully acknowledge the hospitality of these institutions. I also am grate- 
ful to Springer- Verlag for being willing to take the economic risk of adding 
yet another volume to the already huge number of existing introductions 
to numerical analysis. 



Gottingen, September 1997 



Rainer Kress 
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Glossary of Symbols 



Sets and Spaces 




IN 


set of natural numbers 


Z5 


set of integers 


IR 


set of real numbers 


€ 


set of complex numbers 


1*1 


absolute value of a real or complex number x 


(a, b) 


open interval (a, b) {x E IR : a < x < b} 


[a, 6] 


closed interval [a, b\ := {x E IR : a < x < b} 


X 


conjugate of a complex number x 


nr 


n-dimensional real Euclidean space 


<D n 


n-dimensional complex Euclidean space 


C[a,b] 


space of real- or complex- valued continuous 
functions on the interval [a, b] 


C m [a,b ] 


space of m-times continuously 
differentiable functions 


L 2 [a,b\ 


space of real- or complex- valued 
square-integrable functions 


{dl , . ■ . , flrn} 


set of m elements a\ , . . . , a m 


UxV 


product U x V {{x,y) : x G U, y G V} 
of two sets U and V 


u\v 


difference set U \ V {x G U : x £ V} 
for two sets U and V 


u 


closure of a set U 


F : X Y 


a mapping with domain X and range in Y 




Glossary of Symbols 



xii 



Vectors and Matrices 



X = (xi,...,x n ) 


row vector in IR n or € n 
with components xi,...,x n 


x T = (xi,...,x n ) T 


the transpose of x , i.e., a column vector 


X* = (xi,...,x n ) T 


the adjoint of x 


A = (Ojk) 


m x n matrix with elements ajk 


A T 


the transpose of A 


A* 


the adjoint of A 


A t 


the pseudo-inverse of A 


A~ x 


the inverse of an n x n matrix A 


det A 


the determinant of an n x n matrix A 


cond(A) 


the condition number of an n x n matrix A 


P(A) 


the spectral radius of an n x n matrix A 


I 


the n x n identity matrix 


diag(ai, . . . ,a n ) 


diagonal matrix with 
diagonal elements ai , . . . , a n 


Norms 


ii -ii 


norm on a linear space 


ii -I k 


l\ norm of a vector, L\ norm of a function 


ii - ib 


^2 norm of a vector, L<i norm of a function 


II -Hoc 


maximum norm of a vector or a function 


(■>•) 


scalar product on a linear space 


Miscellaneous 


G 


element inclusion 


C 


set inclusion 


u, n 


union and intersection of sets 


0 


empty set 


0(m) 


a quantity of order m 


□ 


end of proof 
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Introduction 



Numerical analysis is concerned with the development and investigation of 
constructive methods for the numerical solution of mathematical problems. 
This objective differs from a pure-mathematical approach as illustrated by 
the following three examples. 

By the fundamental theorem of algebra, a polynomial of degree n has 
n complex zeros. The various proofs of this result, in general, are noncon- 
structive and give no procedure for the explicit computation of these zeros. 
Numerical analysis provides constructive methods for the actual computa- 
tion of the zeros of a polynomial. 

The solution of a system of n linear equations for n unknowns can be 
given explicitly by Cramer’s rule. However, Cramer’s rule is only of the- 
oretical importance, since for actual computations it is completely useless 
for linear systems with more than three unknowns. An important task 
in numerical analysis consists in describing and developing more practical 
methods for the solution of systems of linear equations. 

By the Picard-Lindelof theorem, the initial value problem for an ordinary 
differential equation has a unique solution (under appropriate regularity as- 
sumptions). Despite the fact that the existence proof in the Picard-Lindelof 
theorem actually is constructive through the use of successive iterations, in 
applied mathematics there is need for more effective procedures to numer- 
ically solve the initial value problem. 

In general, we may say that for the basic problems in numerical analysis 
existence and uniqueness of a solution are guaranteed through the results 
of pure mathematics. The main topic of numerical analysis is to provide 
efficient numerical methods for the actual computation of the solution. In 
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some cases these numerical methods are actually based on constructive 
existence proofs. 

By a constructive method we understand a procedure that for any pre- 
scribed accuracy determines an approximate solution by a finite number 
of computational steps. In general, the number of computational steps of 
course will depend on the required accuracy. Only very few methods will 
terminate with the exact solution after finitely many computational steps 
as, for example, Gaussian elimination for solving a system of linear equa- 
tions. In most cases, the numerical methods will only yield approximations 
to the exact solution. As a typical example, the numerical evaluation of 
a definite integral by the trapezoidal rule will, in general, provide only 
an approximate value for the integral. In this context two main questions 
arise, namely the question of estimating the error between the exact and 
the approximate solution and the question of numerical stability. 

A numerical method is useful only if it is possible to decide on the accu- 
racy of the approximate solution, i.e., if reliable estimates on the difference 
between the exact and approximate solution can be given. Therefore, be- 
sides the development and design of numerical schemes, a substantial part 
of numerical analysis is concerned with the investigation and estimation of 
the errors occurring in these schemes. Here one has to discriminate between 
the approximation errors, i.e., the errors that arise through replacing the 
original problem by an approximate problem, and the roundoff errors, i.e., 
the errors that occur through the fact that in the actual computation, in 
general, real numbers are replaced by floating-point decimal numbers with 
a fixed number of digits. 

As far as stability is concerned, one has to distinguish between properly 
and improperly posed problems. A problem is called properly posed or 
well-posed if the solution depends continuously on the data, i.e., if small 
changes in the data cause only small changes in the solution. Otherwise, the 
problem is called improperly posed or ill-posed. Numerical approximations 
never can circumvent the improper posedness of a problem. However, it is 
desirable to control the effects of the ill-posed nature of a problem by an 
adequate choice of the numerical method. On the other hand, for properly 
posed problems efforts have to be made not to destroy the well-posedness 
by a poorly designed numerical approximation. 

To the author’s taste, the topic of stability and properly posedness is 
more challenging from a mathematical perspective than the rather unin- 
spiring topic of roundoff errors. Therefore, in this book emphasis is given 
to ill-posedness and the related issue of ill-conditioning, whereas the dis- 
cussion of roundoff errors is given only cursory attention. 

The basic problems of numerical analysis are as old as mathematics it- 
self, and for a number of problems there exist classical approaches such as 
Newton’s method for the solution of nonlinear equations, Gaussian elimi- 
nation for the solution of systems of linear equations, Gauss-Seidel and 
Jacobi iterations for linear systems, Lagrange interpolation for the ap- 
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proximation of arbitrary functions by polynomials, Simpson’s rule for nu- 
merical integration, and Euler’s method for the solution of initial value 
problems. However, the main breakthrough of numerical methods is con- 
nected with the advances in computer technology made within the last 
four decades. Only the electronic computer allows one to perform exten- 
sive numerical computations without error and within a reasonable amount 
of time. Hence, progress in numerical analysis and computer science have 
always been closely interrelated in recent history. 

This book will introduce the reader to the following branches of numerical 
analysis: 

Solution of systems of linear and nonlinear equations, 

Numerical solution of matrix eigenvalue problems, 

Interpolation and numerical integration, 

Numerical solution of initial and boundary value problems for differ- 
ential equations, 

Numerical solution of integral equations. 

Of course, in an introductory exposition of only about three hundred pages 
it is impossible to cover all of these areas exhaustively. Therefore, the reader 
should not expect a comprehensive treatment of all existing numerical pro- 
cedures. As already pointed out in the preface, our goal will be to guide 
the reader toward the basic ideas and questions in each of the above top- 
ics with an emphasis on the analysis and the understanding of numerical 
methods rather than merely their description. In order to achieve this, 
we will try to illustrate general principles by way of considering the main 
and most important methods, and we will leave aside discussions of more 
elaborate details of advanced methods and the consideration of lengthy 
subtleties for exceptional cases. Given the rapid development of numerical 
methods, a reasonable introduction to numerical analysis has to confine 
itself to presenting a solid foundation by restricting the presentation to the 
basic principles and procedures. 

The book includes a chapter on the necessary basic functional-analytic 
tools for the solid mathematical foundation of numerical analysis. These 
are indispensable for any deeper study and understanding of numerical 
methods, in particular for differential equations and integral equations. 

The limit of space and the taste and restrictions in experience of the 
author have caused the omission of some important topics such as linear 
and nonlinear optimization, approximation theory, and parallel computing, 
among others. On the other hand, with separate chapters on the solution 
of ill-conditioned systems of linear equations and the numerical solution 
of integral equations two topics are included that do not appear in most 
introductions to numerical analysis. They are included because of their im- 
portance and in order to indicate to the reader where the author’s mathe- 
matical research interests lie. 

A study of numerical analysis remains incomplete without the numer- 
ical experience of individually implementing the numerical algorithms. It 
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is very important to build up a familiarity with numerical methods by ac- 
tually seeing the numbers working. For example, one has to complement 
the theoretical understanding of the method of successive approximations 
by the experience of actually running the numerical schemes. After hav- 
ing understood the basic principles of a numerical method, it is important 
to develop the ability to actually implement the method numerically and 
work with it. In this sense the reader is encouraged to test on the computer 
numerically all of the algorithms presented in this book. 

The organization of the book is as follows. The first part of the book, 
Chapters 2 to 7, covers numerical linear algebra and is concerned with 
the solution of systems of linear and nonlinear equations. The necessary 
functional-analytic tools will be presented in Chapter 3. The second part 
of the book, Chapters 8 to 12, covers numerical analysis and is concerned 
with interpolation, numerical integration, and the numerical solution of dif- 
ferential and integral equations. At the reader’s convenience it is possible to 
study most of the second part of the book before reading the first part, with 
the exception of the chapter on functional analysis. Each chapter concludes 
with a set of problems. These are intended as exercises and applications of 
the material given in the chapter. 

The references at the end of the book are intended as a possible guide to 
some of the literature covering the topics of the individual chapters more 
exhaustively. The list of references is not meant as a bibliography on the 
vast number of introductions to numerical analysis competing with this 
book. However, we explicitly encourage the reader to explore the libraries 
and consult some of the other volumes on numerical analysis in order to 
develop a broad perspective. 




2 

Linear Systems 



The solution of systems of linear equations arises in various parts of mathe- 
matics and is of central importance in numerical analysis. To illustrate the 
significance of linear systems, we will start this chapter by providing some 
examples of their occurrence as part of the numerical solution of differential 
and integral equations. After seeing the examples, we will proceed with the 
solution of systems of linear equations. In principle, we have to distinguish 
between two groups of methods for the solution of linear systems: 

1. In the so-called direct methods , or elimination methods , the exact solu- 
tion, in principle, is determined through a finite number of arithmetic 
operations (in real arithmetic leaving aside the influence of roundoff 
errors) . 

2. In contrast to this, iterative methods generate a sequence of approx- 
imations to the solution by repeating the application of the same 
computational procedure at each step of the iteration. Usually, they 
are applied for large systems with special structures that ensure con- 
vergence of the successive approximations. 

A key consideration for the selection of a solution method for a linear 
system is its structure. In some problems, the matrix of the linear system 
may be a full matrix, i.e., it has few zero entries. And in other problems, 
the matrix may be very large and sparse, i.e., only a small fraction of the 
entries are different from zero. Roughly speaking, direct methods are best 
for full matrices, whereas iterative methods are best for very large and 
sparse matrices. 
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We will begin our treatment of linear systems by presenting the best- 
known and most widely used direct method, which is attributed to Gauss, 
since it is based on considerations published by Gauss in 1801 in his Dis- 
quisitiones Arithmeticae. The chapter concludes with a brief description of 
elimination by orthonormal decomposition. 

In this book, for an m x n matrix A — (ajfc), j — 1, . . . , m, k = 1, . . . , n, 
with real or complex coefficients, A T shall always denote the transposed 
matrix; i.e., A T is the n x m matrix with entries 

a lj= a )k, k = j = 

T 

By A * we denote the adjoint of the matrix A\ i.e., A* = A is the transpose 
of the matrix with complex conjugate entries. In particular, the transpose 
and adjoint of a row vector are column vectors and vice versa. 



2.1 Examples for Systems of Equations 

Example 2.1 We consider the discretization of the boundary value prob- 
lem for the ordinary differential equation 

-u"(x) = f(x, u(x)), x G [0, 1], (2.1) 

with boundary condition 

u(0) = u(l) = 0. (2.2) 

Here, / : [0, 1] x IR — v JR is a given continuous function, and we are looking 
for a twice continuously differentiable solution u : [0, 1] — > IR. Boundary 
value problems of this type occur, for example, in the mathematical treat- 
ment of vibrations of a string or a rod and in the solution of heat conduction 
problems. They often also arise in the solution of problems like the following 
Example 2.2 after applying separation of variables. The theory of ordinary 
differential equations (see [12]) provides conditions on the right-hand side / 
of (2.1), ensuring existence and uniqueness of a solution u to the boundary 
value problem (2.1)-(2.2) (for the case of linear differential equations see 
also Chapter 11). 

For the approximate solution we choose an equidistant subdivision of the 
interval [0, 1] by setting 

Xj =jh, j = 0, ...,n + l, 

where the step size is given byh = l/(n + l) with n (E IN. At the internal 
grid points Xj, j — 1 ,...,n, we replace the differential quotient in the 
differential equation (2.1) by the difference quotient 

u"{xj ) « L [u(a; j+ i) - 2u{xj) 4- u(xj- 1 )] 
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to obtain the system of equations 

^2 t 2Uj ^i+l] f '(*^ j 5 ^ j) 5 J 1, . . . ,Tl, 

for approximate values Uj to the exact solution u(xj). This system has to 
be complemented by the two boundary conditions uo = u n + 1 = 0. For an 
abbreviated notation we introduce the n x n matrix 

2 -i \ 

-1 2 -1 
-1 2 -1 

-1 2 -1 
-i 2 / 

and the vectors U = (i/i, . . . ,w n ) T and F(t/) = (/(#i, wi), . . . ,f(x n , u n )) T . 
Then our system of equations, including the boundary conditions, reads 

AU = F(U). (2.3) 

For obvious reasons, the above matrix A is called a tridiagonal matrix, and 
the vector F is diagonal; i.e., the j th component of F depends only on 
the jth component of u. If (2.1) is a linear differential equation, i.e., if / 
depends linearly on the second variable u, then the tridiagonal system of 
equations (2.3) also is linear. 

The following two questions will be addressed later in the book (see 
Chapter 11): 

1. Can we establish existence and uniqueness of a solution to the system 
of equations (2.3) for sufficiently small step size /i, provided that the 
boundary value problem (2.1)-(2.2) itself is uniquely solvable? 

2. How large is the error between the approximate solution Uj and the 
exact solution u(xj )? Do we have convergence of the approximate 
solution towards the exact solution as h — > 0? 

At this point we would like only to point out that the discretization of 
boundary value problems for ordinary differential equations leads to sys- 
tems of equations with a large number of unknowns, since we expect that 
in order to achieve a reasonably accurate approximation we need to choose 
the step size h sufficiently small. □ 




Example 2.2 We now consider the discretization of the boundary value 
problem for the elliptic partial differential equation 

— A u(x) = f(x,u(x)), x £ D, (2.4) 

with Dirichlet boundary condition 



u(x) — 0, x E dD. 



(2.5) 
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Here, D C IR 2 is a bounded domain, A denotes the Laplacian 

A d 2 u d 2 u 
dx\ + dx 2 

/:Z)xIR->lRisa given continuous function, and we are looking for 
a solution u : D — > IR that is continuous in D and twice continuously 
differentiable in D. Boundary value problems of this type arise, for example, 
in potential theory and in heat conduction problems. The theory of elliptic 
partial differential equations (see [24]) provides conditions on the given 
function / that ensure existence and uniqueness of a solution u. 

For describing a numerical approximation method we restrict ourselves 
to the case of the square D — (0,1) x (0,1). We choose an equidistant 
quadratic grid with grid points 

Xij = ( ihjh ), i,j = 0,...,n+ 1, 

where the step size again is given by h = l/(n-|-l) with n E IN. Analogously 
to the previous example, at the internal grid points x ^ , i, j = 1, . . . ,n, we 
replace the Laplacian by the Laplace difference operator 

Aw(xj ? j) ^ [u(xj-|_i ? j ) -f - u(xi-ij') -f- u{xi,j- f_i) 4“ uixi^j— i) 4 u(#^)]. 

Obviously, for each point this difference operator has nonvanishing 
weights only at the four neighboring points on the vertical and horizontal 
line through x ^ . This observation also illustrates why the set of grid points 
with nonvanishing weights is called the star associated with the Laplace 
difference operator. Using this difference approximation leads to the system 
of equations 

~j^2 Hi— ij Uij+i Uij — i ] — / ip^ij I'U'ij)') i-iJ 1, • . . ,U, 

for approximate values Uij to the exact solution u(xij). This system has to 
be complemented by the boundary conditions 

Uo — ^n-\-l,j — d? j 0, . . . , Tl 4" 1, 

at the grid points on the vertical parts and 

ii^O — ^2,71+1 — d, i 1, . . . , 71, 

at the grid points on the horizontal parts of the boundary dD. In order to 
write this system in matrix form we rearrange the unknowns by ordering 
them row by row and setting 



U\ — Mil, U 2 — U21 , • • • , u n — U Hy 1 , u n + 1 — U 12 , • • • , U m — u nni 
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where m = n 2 . Furthermore, we introduce an mxm matrix A in the form 
of an n x n block tridiagonal matrix 




where I denotes the n x n identity matrix and B is the n x n tridiagonal 
matrix 




After introducing the vectors U and F(U) analogously to Example 2.1, we 
can rewrite the system of equations in the short form 

AU = F(U), (2.6) 

which also includes the boundary conditions. 

Again we postpone the questions of unique solvability of the system (2.6) 
and the problem of convergence and error estimates for later parts of the 
book (see Chapter 11). Here, we conclude the example with the observation 
that the system has n 2 unknowns, where n will be fairly large if the step 
size h is sufficiently small in order to achieve a reasonably accurate approxi- 
mation to the solution of the boundary value problem. These large systems 
of equations arising in the discretization of partial differential equations 
call for efficient solution methods. □ 

Example 2.3 Consider the linear integral equation 

<p(x) - f K(x,y)(p(y)dy = f(x), a; €[0,1], 

Jo 

where K : [0, 1] x [0, 1] — > JR and / : [0, 1] IR are given continuous func- 
tions and where we seek a continuous solution <p : [0, 1] — > JR. Such integral 
equations either arise directly in the solution of applied problems, or more 
often they occur indirectly in the solution of boundary value problems for 
differential equations. If the homogeneous form of this equation, i.e., the 
integral equation with the right-hand side / = 0, admits only the trivial 
solution ip = 0, then for each / the inhomogeneous integral equation has a 
unique solution ip (see Chapter 12). 
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For the numerical approximation we replace the integral by the rectan- 
gular sum 

/ K{x,y)<p(y)dy & -y2K(x,x k )<p(x k ) 

Jo n k= i 

with equidistant grid points Xk — k/n , k == If we require the 

approximated equation to be satisfied only at the grid points, we arrive at 
the system of linear equations 

1 n 

<Pj - -'^2 K ( x i> x k)‘Pk - f(xj), j = l,...,n, 

n k = 1 

for approximate values (fj to the exact solution ip{xj). As in the preced- 
ing examples, we postpone the question of unique solvability of the linear 
system and the convergence and error analysis (see Chapter 12). □ 

Example 2.4 In this last example we will briefly touch on the method of 
least squares. Consider some (physical) quantity u depending on time t and 
a parameter vector a — (ai, . . . , a n ) T £ ]R n in terms of a known function 

u(t) = f(t; a). 

In order to determine the values of the parameter a (representing some 
physical constants), one can take m measurements of u at different times 
U , . . . , tm and then try to find a by solving the system of equations 

u(tj) = j = 1, . . . ,m. 

If m = n, this system consists of n equations for the n unknowns a \ , . . . , a n . 
However, in general, the measurements will be contaminated by errors. 
Therefore, usually one will take m > n measurements and then will try to 
determine a by requiring the deviations 



u(tj) - f(tj;a), j = 1, . . . ,m, 



to be as small as possible. Usually the latter requirement is posed in the 
least squares sense, i.e., the parameter a is chosen such that 



9(a) '■= - f(tk-,a)f 

k-1 



attains a minimal value. The necessary conditions for a minimum, 



dg 

daj 



= 0 , 



j = l,...,n, 



lead to the normal equations 



f>fa) -/(**;«)] df fa’ a) j = !>•••.», 

k = 1 aj 
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for the method of least squares. These constitute a system of n, in general, 
nonlinear equations for the n unknowns . . . , a n . □ 

At this point, the reader should be convinced of the need for effective 
methods for solving large systems of linear and nonlinear equations and be 
willing to be introduced to such methods in the subsequent chapters. We 
also wish to note that the discretization of differential equations leads to 
sparse matrices, whereas for the least squares problem and the discretiza- 
tion of integral equations one is faced with full matrices. 



2.2 Gaussian Elimination 

We proceed with describing the Gaussian elimination method for a system 
of linear equations 

Ax — y. 

Here A is a given n x n matrix A = ( ajk ) with real (or complex) entries, y 
a given right-hand side y = (yi , . . . , y n ) T € IR n (or C n ), and we are looking 
for a solution vector x = (#i, . . . ,x n ) T € lR n (or C n ). More explicitly, our 
system of equations can be written in the form 

n 

y ^ a jk x k — Uji j — * • * 5 

k=i 



that is, 

ail^i + Ul2^2 + * ' * + Q'lnXn = yi 
(I 21 X 1 + a 22 ^2 + * ■ • + 0,2n x n ~ V2 



& nl x l “i" CLn2 x 2 "b * “i" ^nn x n — 2/n* 

Assuming that the reader is familiar with basic linear algebra, we recall the 
following various ways of saying that the matrix A is nonsingular: 

1. The inverse matrix A -1 exists. 

2. For each y the linear system Ax = y has a unique solution. 

3. The homogeneous system Ax = 0 has only the trivial solution. 

4. The determinant of A satisfies det A / 0. 

5. The rows (columns) of A are linearly independent. 

The very basic idea of the Gaussian elimination method is to use the first 
equation to eliminate the first unknown from the last n — 1 equations, then 
use the new second equation to eliminate the second unknown from the last 
n — 2 equations, etc. This way, by n — 1 such eliminations the given linear 
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system is transformed into an equivalent linear system that is of triangular 
form 

bnXi 4- b\ 2 X 2 4- • • • 4- b\ n x n — z\ 



b22%2 4- • • * 4" &2n#n — ^2 



bn— l,n— l%n — 1 4" b n —\^ n X n — Z n —\ 
bnn%n — %n 

Recall that two linear systems are called equivalent if every solution of one 
is a solution of the other. The triangular system can be solved recursively 
by first obtaining x n from the last equation, then obtaining x n -\ from the 
second to last equation, etc. This procedure is known as backward substi- 
tution. Explicitly, it is described by x n = z n /b nn and 

z m ^ ^ bm^Xk I 5 m — n 1, n 2, . . . , 1. 

k—m -\- 1 J 

We begin by considering a nonsingular matrix A. To eliminate the un- 
known #i, for j = 2, . . . , n we multiply the first equation by flji/an and 
subtract the result from the jth equation. For this we have to require that 
a ii ^ 0. Since we assume the matrix to be nonsingular, this can be achieved 
by reordering the rows or the columns of the given system. This procedure 
leads to a system of the form 

b\\X\ 4- b\ 2 X 2 4- • • • 4* b\ n x n = z\ 

( 2 ) . . ( 2 ) ( 2 ) 

0>22 X 2 + • * • 4- 0> 2j [x n = 




( 2 ) . . ( 2 ) ( 2 ) 
a n2 X 2 + ••• +annX n =y K n ; 



with the new coefficients given by 
bik > 



„(2) „U) _ a A Q 1 k 

a jk - a jk rn » 



k = 1, . . . , n, 



a 



ii 



and the new right-hand sides given by 



„ (1) ?/ (1) 

Jl) „,(2)._ fl) a j 1 Vl 



Zi : = Vi 4 V) - 



i (1) 

hi 



, j = 2,...,n. 
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Here, for the coefficients and right-hand sides of the original system we 
have set af^ := ajk and := yj. 

Proceeding in this way, the given nxn system for the unknowns x \ , . . . , x n 
is equivalently transformed into an (n — 1) x (n — 1) system for the unknowns 
# 2 , • • . , x n . Adding a multiple of one row of a matrix to another row does 
not change the value of its determinant. Therefore, in the above elimina- 
tion the determinant of the system remains the same (with the exception 
of a possible change of its sign if the order of rows or columns is changed) . 
Hence, the resulting (n — 1) x (n — 1) system for x 2 , • • • ,x n again has a 
nonvanishing determinant, and we can apply precisely the same procedure 
to eliminate the second unknown x 2 from the remaining (n — 1) x (n — 1) 
system. 

By repeating this process we complete the forward elimination , by which 
the system of linear equations 

(i) , (i) , (i) (i) 

<H \ X i + a\{x 2 4- • • • 4- a^Xn = y\ 

«21 ^1 + a 22 X 2 + ‘ • + 0>2n X n = 



a ni x i 4" a$x 2 4- • • • 4- a { n^x n = y^ 

with a nonsingular matrix A = (a^) is equivalently transformed into a 
triangular system 

b\\X\ 4- b\ 2 X2 4- • • • 4- b\ n x n — Z\ 

b 22 x 2 4- * • • 4- b 2 n x n = z 2 



bn—l,n — l%n—l 4~ b n —\, n X n — Z n — 1 
bfiri'En ~ Z n 



by n — 1 recursive elimination steps of the form 



.(^4-1) 

J jk 



a (m) 

a jk 



y(m-hl) (m) 

y j * yj 



(m) (m) 
u jm u mk 

(m) 

drum 



(m) (m) 
(m) 



j,k ~ m + 1, . . . ,n, 

m = 1, . . . , n — 1. 

j —m+l,...,n, 
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The coefficients and the right-hand sides of the final triangular system are 
given by 

b 3 k '■= k = j,...,n,j = l,...,n, 



and 



zj := y 



U) 



j = 



The condition a^l ^ 0, which is necessary for performing the algorithm, 
always can be achieved by a reordering of the rows or columns, since oth- 
erwise the matrix A would not be nonsingular. 

We would like to compress the operations of one elimination step into 
the following scheme 



a 


b 


c 


d 



where the rectangle illustrates the remaining part of the matrix and the 
right-hand side for which the elimination has to be performed. Here, a 
stands for the elimination element, or pivot element; the elements b in 
the elimination row remain unchanged; the elements c of the elimination 
column are replaced by zero (with the exception of the pivot element a); 
and the remaining elements d are changed according to the rule 

, , bo 

d — y d — — . 
a 

We note that in computer calculations, of course, the new values for the 
coefficients of the matrix and the right-hand sides can be stored in the 
locations held by the old values. 

More explicitly, the entire Gaussian elimination can be written in the 
following algorithmic form. 

Algorithm 2.5 (Gaussian elimination) 

1. Forward elimination: 

For m = 1, . . . , n — 1 do 



for + do 



for A; = ra + 1, . . . , n do ajk := ajk ~ 



Q'jmQ'mk 



(^mm 



Vj : = Vj - 



timm 
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2. Backward substitution: 

For m = n,n — 1, . . . , 1 do x nl y m 

for k = m -f 1, . . . , n do x m x rn - a m kXk 
X m 

%m • — 

ttrnm 

If the matrix A is singular and has rank r, the elimination procedure 
will terminate after r steps. The matrix of the remaining (n — r) x (n — r) 
system for the unknowns a; r +i, . . . , x n is the zero matrix, because otherwise 
the rank of A would be different from r. Hence, in this case the given linear 
system is solvable if and only if the right-hand sides after r elimination 
steps satisfy 

Zr+l — ' ' ‘ — Z n — 0. 

The solutions can be found from the triangular system by arbitrarily choos- 
ing x r +i, . . . , x n and then recursively determining x r , . . . , Xq. This way we 
obtain the (n — r)-dimensional solution manifold. 

In order to control the influence of roundoff errors we want to keep the 
quotient /aiXm small; i.e., we want to have a large pivot element a^vm- 

Therefore, instead of only requiring aXXm ^ 0, in practice, either complete 
pivoting or partial row or column pivoting is employed. For complete piv- 
oting, both the rows and the columns are reordered such that aXXn has 
maximal absolute value in the (n-m+1) x (n — m -f 1) matrix remaining 
for the mth forward elimination step. In order to minimize the additional 
computational cost caused by pivoting, for row (or column) pivoting the 
rows (or columns) are reordered such that o^m has maximal absolute value 
in the elimination column (or row), i.e., in the mth column (or row). Of 
course, in the actual implementation of the Gaussian elimination algorithm 
the reordering of rows and columns need not be done explicitly. Instead, 
the interchange may be done only implicitly by leaving the pivot element 
at its original location and keeping track of the interchange of rows and 
columns through the associated permutation matrix. 

The following example illustrates that partial pivoting does not always 
prevent loss of accuracy in the numerical computations. 

Example 2.6 We consider the system 

X! + 200x 2 = 100 

Xi + X 2 = 1 

with the exact solution x\ = 100/199 = 0.502 . . . , x^ — 99/199 = 0.497 

For the following computations we use two-decimal-digit floating-point 
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arithmetic. Column pivoting leads to a\\ as pivot element, and the elimi- 
nation yields 

X\ -f* 200x2 — 100 



- 200x 2 = -99, 

since 199 = 200 in two-digit floating-point representation. From the second 
equation we then have X 2 = 0.50 (0.495 = 0.50 in two decimal digits), and 
from the first equation it finally follows that x\ — 0. 

However, if by complete pivoting we choose ai 2 as pivot element, the 
elimination leads to 

x\ 4* 200x2 = 100 



x\ — 0.5 

(0.995 = 1.00 in two decimal digits), and from this we get the solution 
x\ — 0.5, X 2 = 0.5 (0.4975 = 0.50 in two decimal digits), which is correct 
to two decimal digits. □ 



Since complete pivoting is more costly than partial pivoting, in practical 
computations one can try to overcome the disadvantages of partial pivoting 
by scaling the matrix. This means that if B — D 1 AD 2 , in order to obtain 
the solution x of Ax — y we first solve Bz = D\y for z and then determine 
x from x — D 2 Z. Here D\ and D 2 are some diagonal matrices chosen such 
that for the matrix B the row and column sums of the absolute values are 
approximately equal. A diagonal matrix D — ( djk ) is a matrix with the 
off-diagonal elements equal to zero; i.e., djk = 0 for j / k. For a detailed 
discussion of scaling we refer to [27]. Unfortunately, there is no known 
general procedure for such scaling, i.e., for choosing the diagonal matrices 
D\ and £> 2 - 

For an estimate of the computational cost of Gaussian elimination we 
perform a count of the number of multiplications. By a n we denote the 
number of multiplications that are required for solving a triangular n x n 
system by back substitution. Obviously, for a n we have the recurrence 
relation 

&n — &n—l 4 n, 

since we need n multiplications to obtain x\ from the first equation after 
having already determined X 2 , . - . , x n . Hence, we have 



a r 






n(n 4 1) 



k = 1 



since ot\ = 1. By /3 n?r we denote the number of multiplications needed 
for the forward elimination simultaneously for r different right-hand sides. 
Here we have the recurrence relation 



fin,r — fin— l,r 4 {jl 4 T){n 1), 
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since the elimination of the unknown x\ requires n + r multiplications for 
each row of the n — 1 rows. From this it follows that 



0n,r = ^2{k + r)(k - 1) = — - - + 



n n(n — 1 )r 



k=l 



because /?i ?r = 0. Adding ra n and (3 n , r we obtain the following result. 

Theorem 2.7 Gaussian elimination for the simultaneous solution of an 
n x n system for r different right-hand sides requires a total of 



+ rn 



multiplications. 

The computational cost, counting only the multiplications, in Gaussian 
elimination is n 3 /3 + 0(n 2 ). It is left to the reader to show that the number 
of additions is also n 3 / 3 + 0(n 2 ) (see Problem 2.7). Doubling the number 
of unknowns increases the computation time by a factor of eight. Assuming 
1 fi sec = 10~ 6 sec per addition and multiplication, i.e., on a computer with 
one million floating point operations per second, the solution of a system 
with n — 10 3 requires approximately ten minutes, and with n = 10 4 it 
requires approximately six days. This illustrates dramatically that for the 
solution of large linear systems iterative methods, which we will study in 
Chapter 4, are better suited than direct methods. Row or column pivoting 
leads to an additional cost proportional to n 2 , whereas complete pivoting 
adds costs proportional to n 3 . For the latter reason, complete pivoting is 
used only rarely in practical computations. 

The Gaussian algorithm also allows the computation of the determinant 
and the inverse of a matrix A. The determinant det A is simply given by the 
product of the diagonal elements in the triangular matrix obtained through 
the elimination procedure. If the determinant is computed using expansions 
by submatrices, then the operational count is n\ multiplications, as com- 
pared to n 3 / 3 for Gaussian elimination. This illustrates why Cramer’s rule 
for the solution of linear systems is only a theoretical mathematical tool 
and not a tool for practical computations. 

The inverse of a matrix is obtained by solving the linear system simul- 
taneously for the n right-hand sides given by the columns of the identity 
matrix, i.e., by solving the n systems 

Axi = e*, z = 1, ... ,n, 

where e* is the zth column of the identity matrix. Then the n solutions 
xi,. . . ,x n will provide the columns of the inverse matrix A -1 . We would 
like to stress that one does not want to solve a system Ax — y by first 
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computing A~ l and then evaluating x — A~ l y, since this generally leads 
to considerably higher computational costs. 

The Gauss- Jordan method is an elimination algorithm that in each step 
eliminates the unknown both above and below the diagonal. The com- 
plete elimination procedure transforms the system equivalently into a di- 
agonal system. The multiplication count shows a computational cost of 
order n 3 / 2 + 0(n 2 ), i.e., an increase of 50 percent over Gaussian elimina- 
tion. Hence, the Gauss-Jordan method is rarely used in applications. For 
details we refer to [26, 27]. 



2.3 LR Decomposition 



In the sequel we will indicate how Gaussian elimination provides an LR 
decomposition (or factorization ) of a given matrix. 

Definition 2.8 A factorization of a matrix A into a product 

A = LR 

of a lower (left) triangular matrix L and an upper (right) triangular matrix 
R is called an LR decomposition of A. 

A matrix A = ( ajk ) is called lower triangular or left triangular if ajk = 0 
for j < fc; it is called upper triangular or right triangular if ajk = 0 for 
j > k. The product of two lower (upper) triangular matrices again is lower 
(upper) triangular, lower (upper) triangular matrices with nonvanishing 
diagonal elements are nonsingular, and the inverse matrix of a lower (upper) 
triangular matrix again is lower (upper) triangular (see Problem 2.14). 

Theorem 2.9 For a nonsingular matrix A, Gaussian elimination (without 
reordering rows and columns) yields an LR decomposition. 

Proof In the first elimination step we multiply the first equation by aj\ /an 
and subtract the result from the j th equation; i.e., the matrix A\ — A is 
multiplied from the left by the lower triangular matrix 

1 

ft21 
an 

a nl 

an 

The resulting matrix A<i — L\A\ is of the form 





A2 = 



an _ * 

0 A n - 1 
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where A n -\ is an (n - 1) x (n - 1) matrix. In the second step the same pro- 
cedure is repeated for the (n - 1) x (n - 1) matrix A n -i . The corresponding 
(n — l)x(n — 1) elimination matrix is completed as an n x n triangular 
matrix L 2 by setting the diagonal element in the first row equal to one. In 
this way, n — 1 elimination steps lead to 

Lfi — \ * * ■ L\A. -R, 

with nonsingular lower triangular matrices Li, . . . , L n _ i and an upper tri- 
angular matrix R . From this we find 

A = LR , 

where L denotes the inverse of the product L n _i • • • L\. □ 

We wish to point out that not every nonsingular matrix allows an LR 
decomposition. For example, 




has no LR decomposition. However, since Gaussian elimination with row 
reordering always works, for each nonsingular matrix A there exists a per- 
mutation matrix P such that PA has an LR decomposition (see Problem 
2.16). A permutation matrix is a matrix of the form P = (e p (i), . . . , e p ( n )) 
where e \, . . . , e n are the columns of the identity matrix and p( 1), . . . ,p(n) 
is a permuation of 1, . . . , n. 

Recall that an nxn matrix A is called symmetric if it has real coefficients 
and A = A T . A symmetric matrix A is called positive definite if x T Ax > 0 
for all x G lR n with x /0. Positive definite matrices have positive diagonal 
elements (see Problem 2.10), and therefore a reordering of rows and columns 
is not necessary for Gaussian elimination (for pivoting, the largest diagonal 
element is chosen). It can be shown (see Problem 2.13) that symmetry and 
positive definiteness are preserved throughout the elimination if diagonal 
elements are taken as pivot elements. Therefore, for symmetric positive 
definite matrices the LR decomposition is always possible. If A — LR , then 
we have also A = A T = R T L T , and from Problem 2.15 we can deduce that 
L can be normalized such that A = LL T . Such a decomposition is used 
in the Cholesky method for the solution of linear systems with symmetric 
positive definite matrices. Because of symmetry, the computational cost 
for the Cholesky method is n 3 /6 + 0(n 2 ) multiplications and n 3 /6-f-0(n 2 ) 
additions. For details we refer to [26, 27]. 



2.4 QR Decomposition 

We conclude this chapter by describing a second elimination method for 
linear systems, which leads to a QR decomposition. 
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Definition 2.10 A factorization of a matrix A into a product 



A = QR 



of a unitary matrix Q and an upper (right) triangular matrix R is called a 
QR decomposition of A. 

We recall that a matrix Q is called unitary if 



QQ* = Q*Q = L 



The product of two unitary matrices again is unitary. 

In terms of the columns of the matrices A = (oi,...,a n ) and 

Q = (</i, . . . ,q n ) and the coefficients of R — ( rjk ), the QR decomposition 
A — QR means that 



k 

a k = ^2r ik qi, k=l,...,n. (2.7) 

i=l 

Hence, the vectors ai , . . . , a n of (D n have to be orthonormalized from the 
left to the right into an orthonormal basis q\, . . . , q n . This, for example, can 
be achieved by the Gram-Schmidt orthonormalization procedure (see The- 
orem 3.18). However, since the Gram-Schmidt orthonormalization tends to 
be numerically unstable, we describe the QR decomposition by Householder 
matrices. 

Definition 2.11 A matrix H of the form 

H = I~ 2vv\ 

where v is column vector with v*v = 1, i.e., a unit vector, is called a 
Householder matrix. 

Remark 2.12 Householder matrices are unitary and satisfy H — H * . 
Proof We compute 



H* =1* - 2(vv*)* = I - 2vv* = H 



and 



HH * - H*H = (I - 2 vv*)(I - 2m;*) =1- 4vv * + 4vv*vv* - /, 
where we use that v*v — 1. a 

Geometrically a Householder matrix corresponds to reflection across the 
plane through the origin orthogonal to v. To see this we write 



x = vv*x + y 
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with the component vv*x of x e <C n in the ^-direction and a component y 
orthogonal to v. Then we obtain 



Hx — x — 2vv*x = —vv*x + y ; 



i.e., Hx has the opposite component — vv*x in the ^-direction and the 
same component y orthogonal to v. Because of this property, Householder 
matrices are also called elementary reflection matrices. 

We now describe the elimination of the unknown x\ by multiplying A 
from the left by a Householder matrix H\ = I — 2viv^. By a\ we denote 
the first column of A and by e k the A;th column of the identity matrix; in 
particular, e\ = (1,0,..., 0)*. Then the first column b\ of the product Hi A 
is given by 

b\ = H\Ae\ = H\a\ — a\ — 2v\vlai. 

We would like to achieve that b\ — oe\ with a / 0. Hence, except for the 
first row, v\ must be a multiple of a\. Therefore, we try 



U\ = a\ =F oe\ 



( 2 . 8 ) 



with 



Then we have 



o - < 



I | y/ Q>\ Q>\ 5 ^11 ^ b) 

l a H I 

, ail = 0. 

= 2(a*ai =F |an|-\/ a i 0 i ) 



and 

u\ai = a\ai \an\y/a\ai = -u[ui. 

Without loss of generality we may assume that y/a\a\ — \an\ > 0, since 
otherwise we would have that a\ — a\\e i, i.e., that the first column already 
has the required form. Therefore, if we finally choose 



Vi 



Ul 

\Ju\Ui ' 



then v\ is a unit vector, and as requested we have 



b\ = a\ — U\ u\a\ — a\ — u\ — ±aei . 

U\U\ 

The remaining columns b k — H\Ae k are obtained from the columns a k of 
A by 

b k = HiAek = Hidk = a k - 2v\ v\a k = a k - fe = 2, . . . ,n. 



u\a\ 
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Prom the two possible signs in (2.8) the positive sign yields the numerically 
more stable variant. 

The same procedure is now repeated for the remaining (n — 1) x (n - 1) 
matrix. The corresponding (n — 1) x (n — 1) Householder matrix has to be 
completed as an n x n Householder matrix. In general, if Ak is an n x n 
matrix of the form 



Ak 



Rk * \ 

0 A n —k ) 



with a k x k upper triangular matrix Rk and an (n — k) x (n — k) matrix 
A n -k, we apply the Householder transformation described above with the 
first column of A n _*. With the corresponding ( n — k)x ( n — k ) Householder 
matrix H n -k the n x n matrix 



Mo *1) 

yields an n x n-Householder matrix Hk that leaves the first k columns 
in triangular form and, in addition, transforms the (k -f l)st column into 
triangular form. In this way, after at most n — 1 steps, we arrive at 

H n - 1 ---H 1 A = R 

with Householder matrices Hi,..., H n - \ and an upper triangular matrix 
R. Prom this we obtain 

A = QR 



with the unitary matrix 



Q = (#n-i ••#!)* = 

We summarize our result in the following theorem. 

Theorem 2.13 To each nxn matrix a QR decomposition can be obtained 
through n — 1 Householder transformations. 

The elimination by QR decomposition via Householder matrices can be 
considered as an alternative to Gaussian elimination, since it does not need 
pivoting. However, the operation count shows that 2n 3 /3 + 0(n 2 ) multi- 
plications are required (see Problem 2.18), i.e., twice the cost of Gaussian 
elimination, and the added expense of partial pivoting in Gaussian elim- 
ination does not close this gap. Hence, QR decomposition is rarely used 
for the solution of linear systems. But later in this book we will see that 
QR decomposition is an essential part of one of the best algorithms for 
numerically computing the eigenvalues of a matrix (see Section 7.4) . 
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Problems 

2.1 Solve the linear system 

2x\ + 4 x 2 + X 3 = 4 

2xi 4- 6x2 — £3 = 10 

X\ *+■ 5^2 4* 2x3 = 2 

by Gaussian elimination. 

2.2 Write a computer program for the solution of a system of linear equations 
by Gaussian elimination with partial pivoting and test it for various examples. 
You will need this code as part of other numerical algorithms later in this book. 

2.3 Describe pivoting in Gaussian elimination by using permutation matrices. 

2.4 Let A and B be two n x n matrices. Show that if AB is nonsingular, then 
A and B are nonsingular. 

2.5 Let A, J9, C, and D be n x n matrices and let A be nonsingular. Show that 

det ( C D ) = detAdet (- D ~ C, ' rlB )- 

2.6 Verify the summation formulas 

n n 

k = i n(n -f 1) and k 2 = - n(n -f l)(2n + 1) 

k = 1 fc=i 

that were used in the proof of Theorem 2.7. 

2.7 Prove the analogue of Theorem 2.7 for the number of additions in Gaussian 
elimination. 

2.8 Show that tridiagonal matrices 

\ 

On- 1 Cn—l 

bn On J 

j = 2, ...,n- 1, 

and |ai| > |ci| > 0 and |a n | > |&n| > 0 are nonsingular. 

2.9 Show that Gaussian elimination for tridiagonal n x n matrices requires 4 n 
multiplications. 



/ ai ci 

62 a >2 C2 

&3 a 3 C 3 



b n - 



\ 



with the properties 

K'l > l fc jl + I c j| , bjCj / 0, 
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2.10 Show that the diagonal elements of a positive definite matrix are positive. 

2.11 Prove that if A = LL T where L is a real lower triangular nonsingular nxn 
matrix, then A is symmetric and positive definite. 

2.12 Show that 

is not positive definite. 

2.13 Show that for a symmetric positive definite matrix the symmetry and pos- 
itive definiteness are preserved in Gaussian elimination if diagonal elements are 
taken as pivot elements, i.e., the submatrices j, k = m, . . . , n, are symmetric 
and positive definite. 

2.14 Show that the product of two lower (upper) triangular matrices again is 
lower (upper) triangular, that lower (upper) triangular matrices with nonvanish- 
ing diagonal elements are nonsingular, and that the inverse matrix of a lower 
(upper) triangular matrix again is lower (upper) triangular. 

2.15 Let A be a nonsingular matrix and suppose A = L\R\ = L2-R2, where L\ 
and L2 are lower triangular matrices with diagonal elements equal to one and R\ 
and R2 are upper triangular matrices. Show that L\ — L2 and R\ — R2. 

2.16 Show that for each nonsingular nxn matrix A there exists a permutation 
matrix P such that PA has an LR decomposition. 

2.17 Solve the linear system 

x\ 4- 6x2 — 2x3 = 5 

2xi + X2 — 2x3 = 1 
2xi + 2x2 + 6x3 — 10 

by QR decomposition. 

2.18 Show that the solution of an n x n linear system by QR elimination with 
Householder matrices requires 2n 3 /3 + 0 (n 2 ) multiplications. 

2.19 Let A be a complex nxn matrix and y E C n and assume that A , Re^4, 
and Im A are nonsingular. Show that the nxn complex linear system ^4x = y is 
equivalent to the two nxn real systems 

{(Im ^4) -1 Re^4 + (Re A) -1 Im^4} Rex = (Imv4) -1 Key + (Re^l) -1 Im y, 

{(Im A) -1 Re A + (Re A)~ l Im^4} Imx = (Im A) -1 Im y — (Re^4) _1 R ey. 

2.20 Use QR decomposition to prove Hadamard’s inequality 

n n 

| det A \ 2 < n£M 2 

j= 1 k = 1 

for the determinant of an n x n matrix A — ( ajk ). 



1 2 3 \ 

2 3 4 

3 4 4 / 
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In the subsequent chapters we want to discuss iterative methods for the 
solution of systems of linear and nonlinear equations. For this we will need 
some fundamental concepts of functional analysis, which we will start to 
develop now. We shall use these functional- analytic tools also in later parts 
of this book in some of our convergence and error analysis for the approx- 
imate solution of differential and integral equations. 

We begin by introducing the notions of normed spaces and their ele- 
mentary properties, where we assume that the reader is familiar with the 
concept of linear spaces or vector spaces and their basic properties. Then 
we proceed by considering scalar product spaces as special cases of normed 
spaces. 

We will continue with the discussion of linear and continuous operators 
acting between normed spaces. Particular attention is given to linear oper- 
ators between finite-dimensional spaces, i.e., to matrices and their various 
norms. The main part of this chapter is Banach’s fixed point theorem, also 
known as the contraction mapping principle, which is one of the most im- 
portant tools in numerical analysis and is the fundamental basis of our 
investigations of iterative methods for linear and nonlinear systems. At the 
end of the chapter we will introduce some of the basic concepts of approx- 
imation theory, which will be useful later in other parts of this book. 

For a broader and more detailed study we refer to [5, 34, 35, 39, 59] or 
any other introductory book on functional analysis. 
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3.1 Normed Spaces 



Definition 3.1 Let X be a complex (or real) linear space (vector space). 
A function || • || : X — > JR with the properties 

(Nl) ||®|| > 0, (positivity) 

(N2) ||x|| = 0 if and only if x — 0, (definiteness) 

(N3) \\otx\\ = |a| ||a||, (homogeneity) 

(N4) U® + y || < ||:r|| + Ill/ll, (triangle inequality) 

for all x,y £ X and all a £ C (or is called a norm on X. A linear 
space X equipped with a norm is called a normed space. For X — IR n or 
X = C n we will also call the norm a vector norm. 



Example 3.2 Some examples of norms on IR n and <D n are given by 

/ \ x / 2 

n I n \ 

Nil : =Sl*J'l> Nb == I Sl®i| 2 1 , ll^lloo := .maxjxjl 

j - 1 \j=i ) 3 

for x — (®i, . . . ,x n ) T . It is an easy exercise for the reader to verify that 
the norm axioms (N1)-(N3) are satisfied. The triangle inequality for the 
norms || • ||i and || • ||oo follows immediately from the triangle inequality in 
IR or €. The verification of the triangle inequality for the norm || • H 2 is 
postponed until Section 3.2. □ 

The norms in Example 3.2 are denoted the £ 1 , i 2 , and norm, respec- 
tively. For obvious reasons the £2 norm is also called the Euclidean norm , 
and the ioo norm is called the maximum norm. The three norms are special 
cases of the l v norm 

Nlp:=^EM P j ’ 

defined for any real number p > 1. The norm is the limiting case of 
(3.1) as p — > 00 (see Problem 3.1). 

Remark 3.3 For each norm , the second triangle inequality 

I INI - II2/II | < Ik — 3/II 



holds for all x,y £ X. 

Proof. From the triangle inequality we have 



INI = \\x-y + y\\ < Ik - v\\ + IMI, 
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whence ||x|| - ||y|| < ||x - y\\ follows. Analogously, by interchanging the 
roles of x and y we have ||2/|| - ||z|| < \\y - x\\. □ 

For two elements x,y in a normed space \\x — y\\ is called the distance 
between x and y. 

Definition 3.4 A sequence ( x n ) of elements in a normed space X is called 
convergent if there exists an element x £ X such that 

lim \\x n - x\\ = 0, 

n— >oo 

i.e., if for every e > 0 there exists an integer N(e) such that \\x n — x\\ < e 
for all n > N(e). The element x is called the limit of the sequence (x n ), 
and we write 

lim x n — x 

n— >oo 

or 

x n -> x, n -» oo. 

A sequence that does not converge is called divergent. 

Theorem 3.5 The limit of a convergent sequence is uniquely determined. 

Proof. Assume that x n — > x and x n — > y for n oo. Then from the triangle 
inequality we obtain that 

Ik - 2/11 = Ik - x n + X n - y II < \\x - x n \\ + - y|| 0, n ->■ cx>. 

Therefore, ||rc — t/|| = 0 and x — y by (N2). □ 

Definition 3.6 Two norms on a linear space are called equivalent if they 
have the same convergent sequences. 

Theorem 3.7 Two norms ||-|| a and ||-||6 on a linear space X are equivalent 
if and only if there exist positive numbers c and C such that 

C\\x\\ a < \\x\\ b < C\\x\\ a 

for all x £ X. The limits with respect to the two norms coincide. 

Proof. Provided that the conditions are satisfied, from \\x n — x\\ a -> 0, 
n oo, it follows that \\x n — x\\ b -» 0, n -» oo, and vice versa. 

Conversely, let the two norms be equivalent and assume that there is 
no C > 0 such that ||x||& < C||a;|| 0 for all x £ X. Then there exists a 
sequence ( x n ) with ||x n || a = 1 and ||a; n ||& > n 2 . Now, the sequence ( y n ) 
with y n := x n /n converges to zero with respect to || • || a , whereas with 
respect to || • ||& it is divergent because of ||y n ||6 > n. □ 

Theorem 3.8 On a finite- dimensional linear space all norms are equiva- 
lent. 
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Proof. In a linear space X with finite dimension n and basis ui,...,u n 
every element can be expressed in the form 



n 

X = ^ OtjUj. 

3 = 1 

As in Example 3.2, 

IWloo := max | a A (3.2) 

j=l,...,n 

defines a norm on X . Let || • || denote any other norm on X. Then, by the 
triangle inequality we have 

n 

11*11 < ciMloo 

j- 1 

for all x £ X, where 

j = l 

Assume that there is no c > 0 such that c||x||oo < ||ar|| for all x G X . 
Then there exists a sequence (x u ) with \\x u \\ = 1 such that H^Hoo > v. 
Consider the sequence ( y v ) with y v := av/ll^i/lloo and write 

n 

y v — ^ otji/Uj . 

3 = 1 



Because of ||^||oo = 1 each of the sequences (a^v), j = 1, . . . , n, is bounded 
in C. Hence, by the Bolzano- Weierstrass theorem we can select convergent 
subsequences — > olj, l -» oo, for each j = 1 , . . . , n. This now implies 

\\Vv{t) ~ U 1 1 oo -> 0, i -» oo, where 



n 

y : =52 a j u J’ 

i = i 



and also \\y v (t) - y\\ < C\\y v (i) — 2/||oo -> 0, l — > oo. But on the other hand 
we have \\y v \\ — l/ll^lloo — ► 0, v -> oo. Therefore, y = 0, and consequently 
||2/i^)||oo -> 0, ^ — > oo, which contradicts Halloo = 1 for all v. □ 



The following definitions carry over some useful concepts from Euclidean 
space to general normed spaces. 

Definition 3.9 A subset U of a normed space X is called closed if it con- 
tains all limits of convergent sequences of U . The closure U of a subset U 
of a normed space X is the set of all limits of convergent sequences ofU.A 
subset U is called open if its complement X\U is closed. A set U is called 
dense in another set V ifVcU, i.e., if each element in V is the limit of 
a convergent sequence from U . 
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Obviously, a subset U is closed if and only if it coincides with its closure. 
For xo in X and r > 0 the set B[xo,r] := {x G X : ||x — xo|| < r} is closed 
and is called the closed ball of radius r and center xq. Correspondingly, the 
set B(xo,r) := {x G X : \\x — aro || < r} is open and is called an open ball 

Definition 3.10 A subset U of a normed space X is called bounded if 
there exists a positive number C such that ||x|| < C for all x G J7. 

Convergent sequences are bounded (see Problem 3.6). 

Theorem 3.11 Any bounded sequence in a finite- dimensional normed space 
X contains a convergent subsequence. 

Proof. Let wi,...,w n be a basis of X and let (x„) be a bounded sequence. 
Then writing 

n 

Xy ^ ^ OLji/Uj 

3 = 1 

and using the norm (3.2), as in the proof of Theorem 3.8 we deduce that 
each of the sequences (o^v), j = l,...,n, is bounded in <D. Hence, by 
the Bolzano- Weierstrass theorem we can select convergent subsequences 
OL j'> ^ 00 , f° r eac h j = 1, . . . , n. This now implies 

n 

x u{i) £ OLjUj G X , £ — y oo, 

3 = 1 

and the proof is finished. □ 



3.2 Scalar Products 

Definition 3.12 Let X be a complex (or real) linear space. Then a func- 
tion (-,*): X x X —y C (or JR) with the properties 

(HI) (x,x) > 0, (positivity) 

(H2) (#,#) = 0 if and only if x — 0, (definiteness) 

(H3) ( x,y ) = (y,x), (symmetry) 

(H4) ( ax + (3y,z ) = a(x, z) + fi(y, z), (linearity) 

for all x,y,z G X and a, (3 G C (or IR^ is called a scalar product, or an 
inner product, on X. (By the bar we denote the complex conjugate.) A 
linear space X equipped with a scalar product is called a pre-Hilbert space. 
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As a simple consequence of (H3) and (H4) we note the antilinearity 
(H4') (x,ay + 0z) - a(x,y) + 0(x,z). 

Example 3.13 An example of a scalar product on ]R n and (D n is given by 

n 

(x,y) := ^2 XiVi 

3 = 1 

forx ~ x n ) T andy = (yi, ■ ■ ■ ,y n ) T ■ (Note that ( x,y ) =y*x.) 

Theorem 3.14 For a scalar product we have the Cauchy-Schwarz inequal- 
ity 

l(*,2/)l 2 < (*,*) (y,y) 

for all x,y E X , with equality if and only if x and y are linearly dependent. 
Proof The inequality is trivial for x — 0. For x ^ 0 it follows from 
(ax + 0y,ax + 0y) = |a| 2 (x,x) + 2Re{a0(x, y)} + \0\ 2 (y,y) 

= (x,x)(y,y) - \(x,y)\ 2 , 

where we have set a = —(x,x)~ l / 2 (x,y) and 0 = ( x,x ) 1 / 2 . Since (•,•) is 
positive definite, this expression is nonnegative, and it is equal to zero if 
and only if ax -f (3y = 0. In the latter case x and y are linearly dependent 
because (3 ^ 0. □ 

Theorem 3.15 A scalar product (• , •) on a linear space X defines a norm 
by 

11*11 :=(*,*) 1/2 

for all x E X; i.e a pre-Hilbert space is always a normed space. 

Proof. We leave it as an exercise for the reader to verify the norm axioms. 
The triangle inequality follows by 

II* + vf = (* + y, X + y) < ||z|| 2 + 2||x|| Ill/ll + lls/ll 2 = (||x|| + ||y||) 2 
from the Cauchy-Schwarz inequality. □ 

Note that we can rewrite the Cauchy-Schwarz inequality in the form 

l (*, 2 /)l < 11 * 11 112 / 11 - 

The scalar product of Example 3.13 generates the Euclidean norm of Ex- 
ample 3.2, and therefore it is called the Euclidean scalar product. Theorem 
3.15 includes the triangle inequality for the Euclidean norm that we post- 
poned in Example 3.2. 

The following definition generalizes the concept of orthogonality from 
Euclidean space to pre-Hilbert spaces. 
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Definition 3.16 Two elements x and y of a pre- Hilbert space X are called 
orthogonal if 

(x,y) = 0. 

Two subsets U and V of X are called orthogonal if each pair of elements 
x £ U and y £ V are orthogonal For two orthogonal elements or subsets 
we write x _L y and U _L V, respectively. A subset U of X is called an 
orthogonal system if (x,y) = 0 for all x,y £ U with x ^ y. An orthogonal 
system U is called an orthonormal system if \\x\\ — 1 for all x £ U. 

Theorem 3.17 The elements of an orthonormal system are linearly inde- 
pendent. 

Proof. From 

n 

^2 ak< i k = 0 

k=l 

for the orthonormal system {</i, . . . , q n }, by taking the scalar product with 
qj , we immediately have that otj = 0 for j = 1 , . . . , n. □ 



The Gram-Schmidt orthogonalization procedure as described in the fol- 
lowing theorem provides a converse of Theorem 3.17. For a subset U of 
a linear space X we denote the set spanned by all linear combinations of 
elements of U by span{£/}. 

Theorem 3.18 Let {iio, Mi, . . .} be a finite or countable number of linearly 
independent elements of a pre-Hilbert space. Then there exists a uniquely 
determined orthogonal system {qo, q \, . . .} of the form 

Qn — d" I'm n = 0, 1, . . . , (3-3) 

with r 0 = 0 and r n € span{w 0 , • • • , u n - 1 }, n — 1,2,..., satisfying 

span{u 0 , . . . , u n } = span{go, • • , Qn}, n = 0, 1, ... . (3.4) 

Proof. Assume that we have constructed orthogonal elements of the form 
(3.3) with the property (3.4) up to q n ~ i. By (3.4), the {qo, . . . ,q n - 1 } are 
linearly independent, and therefore ||<fc|| ^ 0 for k = 0, 1, ... ,n — 1. Hence, 



Qn * — 



(Umqk) 

ho {qk ' qk) 



is well-defined, and using the induction assumption, we obtain (q n , q m ) — 0 
for m = 0, . . . , n — 1 and 



span{u 0 , . . . ,u n -i,u n } = span^o, • ■ .,q n -i,u„} = span{g 0 , • - - ,9n-i,</n}- 
Hence, the existence of q n is established. 
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Assume that {qo,qi , . . .} and {qo,qi, • • •} are two orthogonal sets of el- 
ements with the required properties. Then clearly q 0 ~ u 0 = qo- Assume 
that we have shown that equality holds up to q n -\ = q n -\ • Then, since 
qn ~ q n £ span{u 0 , . . . , u n - i}, we can represent q n — q n as a linear combi- 
nation of fli, . . . ,<7n-i; i.e., 



qn - 


n— 1 

~ Qn — ^ ^ ot k q k • 
k=0 


Now the orthogonality yields 




Il9n-«n|| 2 = 


f Qn ~ dm ^ ^ °t k q k J=0, 

V / 



whence q n - q n . 



□ 



3.3 Bounded Linear Operators 

By the symbol A : X — » Y we will denote a mapping whose domain of 
definition is a set X and whose range is contained in a set Y; i.e., for every 
x e X the mapping A assigns a unique element Ax € Y. The range is the 
set A(X) := {Ax : x e X} of all image elements. We will use the terms 
mapping , function, and operator synonymously. (We have already used this 
convention in Definitions 3.1 and 3.12.) 

Definition 3.19 An operator A mapping a subset U of a normed space X 
into a normed space Y is called continuous at x £ U if for every sequence 
(x n ) from U with lim n _*oo x n — x we have lim^oo Ax n = Ax. The function 
A : U -» T is called continuous if it is continuous for all x E U. 

An equivalent definition is the following: A function A : U C X — > Y 
is continuous at x € U if for every e > 0 there exists S > 0 such that 
|| Ax — Ay || < e for all y G U with ||x — y\\ < £. Here we have used the same 
symbol || • || for the norms on X and Y . Note that by the second triangle 
inequality of Remark 3.3 the norm is a continuous function. 

Definition 3.20 An operator A : X -» Y mapping a linear space X into 
a linear space Y is called linear if 

A(ax + fly ) = aAx + flAy 

for all x,y e X and all a,fl £ € (or IR^. 

Theorem 3.21 A linear operator is continuous if it is continuous at one 
element. 
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Proof. Let A : X — ► Y be continuous at x 0 G X. Then for every x G X and 
every sequence ( x n ) with x n — > x, n -» oo, we have 

Ax n = A(x n —x + xq) + A(x — xo) — » A(x o) + A(x — xo) = A(x ), n — > oo, 

since ar n — a; + Xo — ► xq, n —> oo. □ 

Definition 3.22 A linear operator A : X -* F /rum a normed space X 
into a normed space Y is called bounded if there exists a positive number 
C such that 

\\M < c\\x\\ 

for all x G X . Each number C for which this inequality holds is called a 
bound for the operator A. (Again we have used the same symbol || • || for 
the norms on X and Y.) 

Theorem 3.23 A linear operator A : X — > Y is bounded if and only if 

P|| := sup P^|| < oo. 

11 * 11=1 

The number \\A\\ is the smallest bound for A and is called the norm of A. 

Proof. Assume that A is bounded with the bound C. Then 

sup Pa; 1 1 < C, 
ll*ll=i 

and, in particular, p|| is less than or equal to any bound for A. Conversely, 
if p| | < oo, then using the linearity of A and the homogeneity of the norm, 
we find that 

| |Ar||=||^(^)|| \\x\\ < \\A\\ |N| 

for all x ^ 0. Therefore, A is bounded with the bound p||. □ 

Theorem 3.24 A linear operator is continuous if and only if it is bounded. 

Proof. Let A : X — > Y be bounded and let (x n ) be a sequence in X with 
x n — » 0, n -* oo. Then from pa; n || < C||a; n || it follows that Ax n — »• 0, 
n —> oo. Thus, A is continuous at x = 0, and because of Theorem 3.21 it is 
continuous everywhere in X. 

Conversely, let A be continuous and assume that there is no C > 0 such 
that pa; 1 1 < C\\x\\ for all x G X. Then there exists a sequence (a; n ) in X 
with ||x n || = 1 and P# n || > n. Consider the sequence y n x n /\\Ax n \\. 
Then y n — ► 0, n -> oo, and since A is continuous, Ay n -> A(0) = 0, n -> oo. 
This is a contradiction to p2/ n || = 1 for all n. Hence, A is bounded. □ 

Remark 3.25 Let X , Y, and Z be normed spaces and let A : X — » Y and 
B : Y -» Z be bounded linear operators. Then the product BA : X Z, 
defined by ( BA)x := B(Ax) for all x G X , is a bounded linear operator 
with\\BA\\<\\A\\\\B\\. 

Proof. This follows from ||(J3A)a;|| = ||R(Aa;)|| < ||R|| P|| ||a;||. □ 
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3.4 Matrix Norms 



Theorem 3.26 Let ( djk ) be a real or complex n x n matrix. Then the 
linear operators A : lR n IR n and A : C n — > C n , defined by 

n 

(Ax)j := y^ajkXk, j = l,...,n, 

k = 1 

are bounded with respect to each norm on IR n and C n . In particular, we 
have 



n 



Mill = max E Ml. 

, '" ,Tl j= 1 


(3.5) 


n 

Halloo = max EMI> 


(3.6) 


/ » \ 1/2 




Mila < E Ml 2 • 


(3.7) 



j,k = 1 



In this case the norms are also called matrix norms. (Note that in (3.5)- 
(3.7) both the domain and the range are given the same norm.) 



Proof. By Theorem 3.8 it suffices to prove boundedness of A with respect 
to one norm. For || • ||i we can estimate 



\\M\i 



n 



n 



EKM;I = E 

3 = 1 3 = 1 



n 

^ ^ Q'jk^k 
k=l 



n n 



n n 



k = 1 j = 1 

Therefore, we have that 



< £Nl£M - ^ n £|a;*l£l**l 



j = 1 k = 1 



||A||i < max £K*|. 



(3.8) 



3 = 1 



Now choose i such that 



Em = 



3=1 3 = 1 

and choose 2 ; 6 IR n with Z{ — 1 and 2 * = 0 for k ^ i. Then || 2 ||i = 1 and 



3= 1 i=i 



E 



k = 1 



EMI = ^0^ EMI- 

i=l J=1 
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Hence 

n 

Mill = sup ||Ae||i > \\Az\h = max EMI> (3.9) 

imu=i t= i n ~r 

and from (3.8) and (3.9) we obtain (3.5). 

For || • | |oo we can estimate 

1 1 A.x 1 1 oo — .max |(Ax)^| = max 



n 

^ 1 a jk x k 
k=l 



< 



max max max tad. 

/=l,...,n 1 ' j=l,...,n ' k=l 

k=l k = 1 



Therefore, we have that 



n 

Mlloo < max EMI- (3.10) 

Now choose i such that 

n n 

Em = Em, 

k—1 k = l 

and choose z € € n with z k = ai*/|aijfe| if a** # 0 and z* = 1 if = 0. 
Then H^Hoo = 1 and 

H^Hoo = max |(Az)j| = max 

j=l,...,n j=l,...,n 



n 

^ a jk z k 
k = 1 



> 



n 

^ ^ Q*ik z k 
k=l 



= Em = . i ? ax Eim 

7=1,. ...n 

*=i *=i 



Hence 



Halloo = sup Ma;||oo > Halloo = max EMI. (3.11) 

ll*ll«.=i 1=h -’ n t^i 

and from (3.10) and (3.11) we obtain (3.6). 

Finally, for || • || 2 , using the Cauchy-Schwarz inequality we can estimate 



IMlMEl(M;l 2 = E 



3 = 1 



i=i 



^ ^ a jk x k 



k=i 



< Y ] Y m 2 Y w 2 \ = Y m 2 Y i x *i 2 - 

j = i Kk=i k=i J j,k = l k=i 
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Therefore, 

Pill < £ M 2 ’ 

j,k = 1 

and (3.7) is proven. In this inequality equality does not hold, in general, as 
can be seen by considering the identity matrix. □ 

In order to derive a representation for ||A|| 2 we need to recall the defini- 
tion and some basic facts about eigenvalues and eigenvectors of a matrix. 
A number A E C is called an eigenvalue of the matrix A if there exists a 
vector x € <D n with x/0 such that 

Ax = Xx. 

The vector x is called an eigenvector for the eigenvalue A. Each n x n 
matrix has at least one and at most n eigenvalues, since the characteristic 
polynomial det(A — XI) has at least one and at most n zeros. Eigenvectors 
for different eigenvalues are linearly independent (see Problem 3.12). The 
algebraic multiplicity of an eigenvalue of a matrix is its multiplicity as a zero 
of the characteristic polynomial; its geometric multiplicity is the number of 
linearly independent eigenvectors associated with the eigenvalue. 

Theorem 3.27 To each matrix A there exists a unitary matrix Q such 
that Q*AQ is an upper triangular matrix. 

Proof. Assume that it has been shown that for each (n — 1) x (n — 1) 
matrix A n _ x there exists a unitary (n — 1) x (n - 1) matrix Q n -i such that 
Q* n _ x A n _i<2 n _! is an upper triangular matrix. Let A be an eigenvalue of the 
n x n matrix A n with eigenvector u. We may assume that (u, u) — 1, where 
(• , •) is the Euclidean scalar product. Using the Gram-Schmidt procedure 
of Theorem 3.18 we can construct an orthonormal basis of C n of the form 
u, V 2 , . . • , v n . Then we define a unitary n x n matrix by 

U n := (it, t7 2 , . . . , v n ). 

With the aid of ( u , Vj) = 0, j = 2, . . . , n, we see that 

U*A n U n = U*(\u, A n v 2 , A n v n ) = ( J 

with some (n — 1) x (n — 1) matrix A n -\. By the induction assumption there 
exists a unitary (n — 1) x (n — 1) matrix Q n - i such that Qn-iA n -iQ n - 1 
is upper triangular. Then 

: = ^ ( J qI, ) 

defines a unitary n x n matrix, and Q* A n Q n is upper triangular. □ 
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Lemma 3.28 For an n x n matrix A and its adjoint A* we have that 

(Ax,y) = ( x,A*y ) 

for all x,y G C n , where (• , •) denotes the Euclidean scalar product. 

Proof. Simple calculations yield 

n n n 

(Ax,y) = Y^{Ax) j y j = 

j= 1 3 = 1 *=1 

n n n 

Xkaljyj = 'Y^x k A*y k = (x, A*?/), 

fc=l j=l ik=l 

where we have used that aj£ ■ = ajk. □ 

Theorem 3.29 The eigenvalues of a Hermitian nxn matrix are real , and 
the eigenvectors form an orthogonal basis in C n . 

Proof. If A is Hermitian, i.e., if A = A*, then the matrix A := Q*AQ from 
Theorem 3.27 is also Hermitian, since 

A* = ( Q*AQ)* = Q*A*Q** = Q*AQ = A. 

Therefore, in this case the upper triangular matrix A must be diagonal; 
i.e., 

A = D := diag(Ai,...,A n ). 

Since from Q*AQ — D it follows that AQ = QD , we can conclude that 
the columns of Q = (wi, . . . , u n ) satisfy Auj — A jUj, j = 1, . . . , n. Hence 
the eigenvectors of a Hermitian matrix form an orthogonal basis in C n . 
Because of 

Xj = (Auj , Uj ) = (uj,Auj) = (Auj , uj ) = Xj, 

the eigenvalues of Hermitian matrices are real. □ 

For a positive semidefinite matrix A , i.e., for a Hermitian matrix with 
the property 

(Ax,x) >0, x € C n , 

all eigenvalues are real and nonnegative. Analogously, the eigenvalues of a 
positive definite matrix A , i.e., of a Hermitian matrix with the property 

(Ax,x) >0, x e C n , x ^ 0, 



are positive. 
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Definition 3.30 The number 

p(A) := max {|A| : A eigenvalue of A} 
is called the spectral radius of A. 

Theorem 3.31 For an n x n matrix A we have 

P || 2 = VriA^A). 



If A is Hermitian , then 

Uh = p(A). 

Proof From Lemma 3.28 we have that 

||Ar ||2 = (Ax, Ax) = (x,A*Ax) 

for all x € C n . Hence the Hermitian matrix A* A is positive semidefinite 
and therefore has n orthonormal eigenvectors 

A*Auj = fijUj, j = l,...,n, 

with real nonnegative eigenvalues. We use the orthonormal basis of eigen- 
vectors and represent x G C n by 



and have 



n 

x = 

3 ~ 1 



n n 



\\x\\l = (x, x) = OLjUj, a * U k = ^2 



Otj 



ij=l k - 1 



3 = 1 



and 



n n 



\\Ax\\\ = (Ax, Ax) = (x,A*Ax) - I p 2 k a k u k 1 = ]T]/^K| 2 - 

\i =1 k=\ ) j = 1 



From this we obtain that 

\\Ax\\i < p(A*A)\\x\\l 

whence 

\\A\\l<p(A*A) 

follows. On the other hand, if we choose j such that = p(A* A), then we 
have that 

\\A\\l = [ sup ||Ar|| 2 ] 2 > \\Auj\\l = (u j ,A*Au j ) = p) - p(A* A). 

IW|2 = 1 




3.4 Matrix Norms 39 



This concludes the proof of \\A \\2 = y/p{A*A). If A is Hermitian, then 
A* A = A 2 , whence p(A*A) = p(A 2 ) = [p(A)] 2 follows. □ 



The following final theorem of this section is of basic importance for 
establishing a necessary and sufficient condition for the convergence of it- 
erative methods for linear systems. 

Theorem 3.32 For each norm on <D n and each n x n matrix A we have 
that 

P(A) < ||A||. 

Conversely , to each matrix A and each e > 0 there exists a norm on C n 
such that 

\\A\\<p(A)+e. 



Proof. Let A be an eigenvalue of A with eigenvector u. We may assume that 
|M| = 1. Then the first part of the theorem follows from 

Pll = sup ||Ar|| > ||Au|| = ||Au|| = |A|. 

Il*ll=i 

For the second part, by Theorem 3.27 there exists a unitary matrix Q 
such that 

/ hi bi2 b\s • • bi n \ 

£>22 £>23 • • £>2n 

£>33 • • £>3n 



B = Q*AQ = 



v 



£>nn ) 



is upper triangular. Because of det(A7 - A) = det(A / - B ), the eigenvalues 
of A are given by A j — bjj, j = 1 , . . . , n. We set 



h: = , max \b jk \ 

1 < 7 <«<n 



and 



and define the diagonal matrix 



with the inverse 


D 


:= diag(l,(5, 6 2 ,.. 




D~ l = 


diag(l, J -1 , 5~ 2 , 




Then for C := D l BD we have that 






( b n 


Sbi2 S 2 bis 

£>22 <5£>23 


■ 5 n - l b ln 

■ 6 n - 2 b 2n 


c = 




£>33 


■ 6 n ~ 3 b 3n 




\ 




b-nn 
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Since 8 < 1, by Theorem 3.26, we can estimate 

||C||oo < max \bjj \ -F (n - 1 )8b < p(A) + £. 
j=l,...,n 

After setting V := QD we define a norm on <D n by ||x|| := ||l / ~ 1 x|| 00 . Using 
C = V~ 1 AV we now obtain 

||Ac|| = \\V' l Ax\\oo = HCV-^lloo < liqioollU-^lloo - IICIIoolNI 
for all x G C n . Hence 

Mil < Halloo < p(A) + e, 

and the proof is finished. □ 



3.5 Completeness 

Definition 3.33 A sequence (x n ) of elements in a normed space X is 
called a Cauchy sequence if for every e > 0 there exists an integer N(s) 
such that 

\\Xn ~X m \\ < £ 

for all n,m> N{e), i.e., if lim n>m _^oo \\x n - x m \\ = 0. 

Theorem 3.34 Every convergent sequence is a Cauchy sequence. 

Proof. Let x n -> x, n -> oo. Then, for e > 0 there exists N(e) G IN such 
that ||x n — x|| < e/2 for all n > N(e). Now the triangle inequality yields 

\\x n ~ Xm\\ = \\x n -X + X- X m || < ||x n ~ x|| + ||X - X m || < £ 

for all n, m > N(e). □ 

The fact that the converse of Theorem 3.34 is not true in general gives 
rise to the following definition. 

Definition 3.35 A subset U of a normed space X is called complete if 
every Cauchy sequence of elements in U converges to an element in U . A 
normed space is called a Banach space if it is complete. A pre-Hilbert space 
is called a Hilbert space if it is complete. 

The subset of rational numbers is not complete in IR. In order to give 
further examples, we introduce some infinite-dimensional normed spaces. 

The set C [a, b] of continuous functions / : [a, 6] — ► IR equipped with 
pointwise addition and scalar multiplication, 

(/ + g)(x) := f(x) + g(x), ( af)(x ) := af(x), 

obviously is a linear space. Since the monomials x x n , n = 0, 1, . . . , are 
linearly independent (see Theorem 8.2), (7 [a, b] has infinite dimension. 
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Example 3.36 The linear space C[a, b] furnished with the maximum norm 

11/11°°:= max I /(x) | 

x(E[a,oJ 



is a Banach space. 

Proof. The norm axioms (N1)-(N3) are trivially satisfied. The triangle in- 
equality follows from 

11/ + fllloo = max |(/ + ff)(x)| = |(/ + 0)(x o )| < |/(*o)| + |s(®o)| 

x£[a,b ] 



< max |/(x)| + max |s(x)| = ||/||oo + MU 

z€[a,oJ ;c€[a,oJ 



for some Xo G [a, b\. Since the condition ||/ n — /||<x> < £ is equivalent to 
| / n (x) — f(x) | < e for all x G [a, 6], convergence of a sequence of continuous 
functions in the maximum norm is equivalent to uniform convergence on 
[a, b\. Since the Cauchy criterion is sufficient for uniform convergence of a 
sequence of continuous functions to a continuous limit function, the space 
C[a,b] is complete with respect to the maximum norm. □ 

Example 3.37 The linear space C[a, b] equipped with the L\ norm 



==/* 



|/(x)| dx 



is not complete. 

Proof. The norm axioms are trivially satisfied. Without loss of generality 
we take [a, b] = [0, 2] and choose 

{ x n , 0 < x < 1, 

1, l<x<2. 



Then for m > n we have that 

II fn - /mill = [ (x n - x m ) dx < —1— -> 0, n -)• 00 , 

Jo n + 1 

and therefore (/ n ) is a Cauchy sequence. Now we assume that (/ n ) con- 
verges with respect to the L\ norm to a continuous function /; i.e., 

II fn ~ f 111 — ► 0, n -> oo. 



Then 



[ \f(x)\dx<[ \f(x)-x n \dx+ f x n dx < ||/ - / n ||i H -> 0 

Jo Jo Jo n - hi 
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for n -* oo, whence f(x) =0 follows for 0 < x < 1. Furthermore, we have 





1/0*0 - fn{x)\dx < ||/-/„||l 



-)► 0 , 



n — > oo. 



This implies that f(x) — 1 for 1 < x < 2, and we have a contradiction, 
since / is continuous. 

However, we note that the space L l [a,b] of measurable and Lebesgue 
integrable real-valued functions is complete with respect to the L\ norm 
(see [5, 51, 59]). □ 



Example 3.38 The linear space (7[a, b] equipped with the L 2 norm 



ll/lb 




1/2 



is not complete. 

Proof. The norm is generated by the scalar product 

(f,9) ■= [ f(x)g{x)dx. 

J a 

Considering the same sequence as in Example 3.37, it can be seen that 
C[a, b] also is not complete with respect to the L 2 norm. Again note that 
the space L 2 [a, b] of measurable and Lebesgue square-integrable real- valued 
functions is complete with respect to the L 2 norm (see [5, 51, 59]). □ 

Theorem 3.39 Each finite- dimensional normed space is a Banach space. 

Proof. Let X be finite-dimensional with basis u\ , . . . , u n and assume that 
(x u ) is a Cauchy sequence in X. We represent 

n 

X v — a jv u j 
3 = 1 

and recall from Theorem 3.8 that there exists C > 0 such that 
max \ol~ v - otjJ < C\\x v - xJ\ 

j=l,...,n 

for all v,p E IN. Hence for j = 1 ,...,n the (aj^) are Cauchy sequences 
in (C. Therefore, there exist ai, . . . ,a n such that otj v — > aj, v 00 , for 
j = 1, . . . ,n, since the Cauchy criterion is sufficient for convergence in (D. 
Then we have convergence, 

n 

X v -* X ^ OLjUj E X, 1/ —> OO, 

3 — 1 



and the proof is finished. 



□ 
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Remark 3.40 Complete sets are closed, and each closed subset of a com- 
plete subset is complete. 

Proof. This is trivial. □ 



3.6 The Banach Fixed Point Theorem 

Definition 3.41 Let U be a subset of a normed space X. An operator 
A : U -* X is called a contraction operator if there exists a constant 
q (E [0, 1) such that 

II Ax - Ay\\ < q\\x - y|| 

for all x,y E U . Each constant q satisfying this inequality is called a con- 
traction number of the operator A. 

Frequently, we will call a contraction operator simply a contraction. 

Remark 3.42 Each contraction operator is continuous. 

Proof. This is trivial, since the convergence \\x n — x|| -4 0, n oo, implies 
that || Ax n — Ax || < q\\x n — x\\ 0, n -* oo. □ 

An operator A : U — > X is called Lipschitz continuous with Lipschitz 
constant L if there exists a positive constant L such that 

\\Ax~ Ay\\ < L\\x - y\\ 

for all x, y £ U. Thus, contraction operators are Lipschitz continuous op- 
erators with Lipschitz constant less than one. 

Definition 3.43 An element x of a normed space X is called a fixed point 
of an operator A : U C X -* X if 

Ax = x. 

Theorem 3.44 Each contraction operator has at most one fixed point. 

Proof. Assume that x and y are two different fixed points of the contraction 
operator A. Then 



o ^ II® - 2/11 = \\Ax - Ay\\ < q\\x - y\\, 

whence 1 < q follows. This is a contradiction to the fact that A is a con- 
traction operator. □ 

Theorem 3.45 (Banach) Let U be a complete subset of a normed space 
X and let A : U — > U be a contraction operator. Then A has a unique fixed 
point. 
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Proof. Starting from an arbitrary element xq € U we define a sequence ( x n ) 
in U by the recursion 



Then we have 



*£n+ 1 • — Ax n , Tl — 0, 1, 2, ... . 



Ikn +1 -an 1 1 = II Ax n -AXn-xW < q\\x n X n —i || , 

and from this we deduce by induction that 

Ikn +1 - x„|| < q n \\xi - x 0 ||, n = 1,2,.... 

Hence, for m > n, by the triangle inequality and the geometric series it 
follows that 

\\%n %m\\ 5; \\%n ^n+l|| 4” H^n+l — ^n-f2|| "f" * * ' “h ||x m _i %m\\ 

(3.12) 

< ( Q n + q n+1 + • • • + 9 m " 1 )||*i - *o|| < T— l^i " *o||- 

i -q 

Since q n -4 0, n — > oo, this implies that ( x n ) is a Cauchy sequence, and 
therefore because U is complete there exists an element x £ U such that 
x n -» x, n — > oo. Finally, the continuity of A from Remark 3.42 yields 



x = lim x n +\ — lim Ax n — Ax; 

n— >oo n—too 

i.e., x is a fixed point of A. That this fixed point is unique we have already 
settled by Theorem 3.44. □ 

The main importance of Banach’s fixed point theorem in numerical anal- 
ysis originates from its constructive proof. Besides establishing existence of 
a fixed point by the method of successive approximations, it also provides 
an algorithm for obtaining numerical approximations. And this algorithm 
is very easy to program because of its iterative nature. We explicitly state 
this in the following theorem. 

Theorem 3.46 Let A be a contraction operator with contraction constant 
q mapping a complete subset U of a normed space X into itself Then the 
successive approximations 

X„+1 := Ax n , n = 0,1,2,..., 

with arbitrary xo € U converge to the unique fixed point x of A. We have 
the a priori error estimate 

||*n - *|| < y-^-y ll*i - *o|| 
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and the a posteriori error estimate 

11*71 - *11 < ll*« - *71— 1 1| 

i -q 

for all n 6 IN. 

Proof. The a priori error estimate follows from (3.12) by passing to the 
limit m oo. The a posteriori estimate follows from the a priori estimate 
applied with starting element xo = x n -i* a 

The a priori estimate is used in order to obtain upper bounds on the 
number of iteration steps, which are necessary to achieve a desired accuracy. 
In order to guarantee that 



\\x n - x\\ <e 

for a given accuracy e, by the a priori estimate we need 

\ne 



n > 



In q 



iterations, where e = (1 — q)e/\\xi — #o||. The smaller the contraction con- 
stant q , the fewer iteration steps are required. The a posteriori estimate, 
which in general yields better estimates as compared with the a priori esti- 
mate, is used to check the accuracy during the computation and terminate 
the iterations when the required accuracy is reached. 

The property 

\\Ax-Ay\\ < ||a:- y|| 

for all x, y with x / y, which is weaker than the contraction property, is not 
sufficient in general to ensure the existence of a fixed point, as illustrated 
in the following example (see also Problem 3.18). 

Example 3.47 The function / : [0, oo) — > [0, oo) given by 

m :=I + TTi 

as a consequence of 

/(*) - f(y) = y 

fulfills the condition 

l/(*) - f(y ) I < I* — 3/1 

for x y £ y. However, because of 

1 



x -h y + xy 
+ x H- y + xy 



(*-y) 



1 + X 



>0 



for all x > 0, it does not have a fixed point. 



□ 
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We conclude this section by considering the special case of linear opera- 
tors, i.e., by considering the Neumann series (see Problem 3.16). 

Let A : X -* Y be an operator mapping a set X into a set Y . If for 
each y E A(X) there is only one element x E X with Ax = y, then A is 
said to be injective and to have an inverse A~ l : A(X) — ► X defined by 
A~ l y := x. The inverse mapping satisfies A~ l A = I on X and AA~ l = / 
on A(X), where I denotes the identity operator mapping each element into 
itself. If A(X) — Y, then the mapping is said to be surjective. The mapping 
is called bijective if it is injective and surjective, i.e., if the inverse mapping 
A~ l : Y -* X exists. 

Theorem 3.48 Let B : X -» X be a bounded linear operator on a Banach 
space X with ||B|| < 1, and let I : X -> X denote the identity operator. 
Then I — B is bijective; i.e., for each z E X the equation 

x — Bx — z 



has a unique solution x E X. The successive approximations 



Xn- i-i • — Bx n z , n — 0, 1, 2, ... , 



with arbitrary xo € X converge to this solution, and we have the a priori 
estimate 

\\Xn-x\\ < 11*1 -*o|| 

and the a posteriori estimate 



\\xn - x\\ < 



m\ 

i - n*n 



*n-l 



for all n E IN. Furthermore, the inverse operator (I — B) 1 is bounded by 
Proof. For fixed, but arbitrary, z E X we define the operator A : X -* X 

by 

Ax := Bx + z, x E X. 

Then we have 



II Ax - Ay\\ = || B(x - y ) || < ||B|| ||x - j/|| 

for all x, y € X; i.e., .4 is a contraction with contraction number q = ||J3||. 
Now the statements of the theorem can be deduced from Theorem 3.46. 
With the starting element x 0 — z the successive approximations lead to 

n 

x n = Y 

k = 0 
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with the iterated operators B k : X -» X defined recursively by B° I 
and B k := BB k ~ l for k £ IN. Hence, in view of Remark 3.25, we have 

IM < £ ll^ll < £ P?ll fc NI < tztL . 

k = 0 k=0 11 11 

and therefore, since x n — > (I — B)~ l z, n — > oo, it follows that 



for all z £ X. 



□ 



3.7 Best Approximation 

Definition 3.49 Let U C X be a subset of a normed space X and let 
w £ X . An element v £ U is called a best approximation to w with respect 
to U if 

lk~«ll = inf lk-u||, 

uEU 

i.e., if v £ U has smallest distance from w. 

Theorem 3.50 Let U be a finite- dimensional subspace of a normed space 
X. Then for every element in X there exists a best approximation with 
respect to U . 

Proof. Let w £ X and choose a minimizing sequence ( u n ) for w; i.e., u n £ U 
satisfies 

|| w — u n || -> d := inf \\w — u||, n -> oo. 

uEU 

Because of ||u n || < ||iy - w n || -f \\w\\ the sequence (u n ) is bounded. By 
Theorem 3.11 the sequence (u n ) contains a convergent subsequence (u n (£)) 
with limit v £ U. Then 



Ik - HI = lim Ik - u n(«)ll = d 



completes the proof. □ 

Theorem 3.51 Let U be a linear subspace of a pre-Hilbert space X . An 
element v is a best approximation to w £ X with respect to U if and only 

if 

{w - v,u) — 0 (3.13) 

for all u £ U, i.e., if and only if w — v _L U. To each w £ X there exists at 
most one best approximation with respect to U . 
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Proof. We begin by noting the equality 

||u? - u\\ 2 = || w - v\\ 2 + 2 R e(w — v,v — u) + \\v — u|| 2 , (3.14) 



which is valid for all u,v G U. From this, sufficiency of the condition (3.13) 
is obvious, since U is a linear subspace. 

To establish the necessity we assume that v is a best approximation and 
(w — v,uq) ^ 0 for some uo £ U. Then, since U is a linear subspace, we 
may assume that (w — v, uq ) G 1R. Choosing 



u = v + 



(w - t ;,w 0 ) 

IK || 2 



^o, 



from (3.14) we arrive at 

Ik - “II 2 = Ik - ^ll 2 - ^ W || M q|| 2°^ < ll W - ^ll 2 ’ 

which contradicts the fact that v is a best approximation of w. 

Finally, assume that v\ and V2 are best approximations. Then from (3.13) 
it follows that (w — Vi, V\ — V 2 ) = 0 = (w — v 2,^1 — ^ 2 ). This implies 
(vi — V2, v\ — V2) = 0 , whence v\ = V2 follows. □ 

Theorem 3.52 Let U be a complete linear subspace of a pre- Hilbert space 
X. Then to each element w E X there exists a unique best approximation 
with respect to U . The operator P : X -» U mapping w E X onto its best 
approximation is a bounded linear operator with the properties 

P 2 = P and ||P|| = 1. 



It is called the orthogonal projection from X onto U. 

Proof. Choose a sequence ( u n ) with 

||iu - u n || 2 < d 2 + — , n E IN, (3.15) 

n 

where d inf nG jy ||ic — u||. Then 

II (w - U n ) + (w- u m )\\ 2 + \\u n - U m \\ 2 = 2|| W - U n || 2 -f 2|| W - Um\\ 2 



< 4d 2 + - + - 
n m 



for all n, m E IN, and since \ ( u n 4- u m ) G £/, it follows that 



\\u n — u m || < 4 d H 1 4 

n m 



UJ ^ (Mri 4“ ^m) 



2 2 

K — | . 

n m 
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Hence, ( u n ) is a Cauchy sequence, and since U is complete, there exists an 
element v G U such that u n -¥ v, n oo. Passing to the limit n oo 
in (3.15) shows that v is a best approximation of w with respect to U. 
Uniqueness of the best approximation follows from Theorem 3.51. 

Trivially, we have Pu = u for all u G U, and this implies P 2 = P. From 
(3.13) it can be deduced that P is a linear operator and that 

IMI 2 = ||Pw|| 2 -f ||w - Pwf > ||Pu;|| 2 

for all w € X. Therefore, P is bounded with ||P|| < 1. From Remark 3.25 
and P 2 — P it follows that ||P|| > 1, which concludes the proof. □ 

Corollary 3.53 Let U be a finite- dimensional linear subspace of a pre- 
Hilbert space X with basis . . . , u n . The linear combination 

n 

V = ^ ~^a k Uk 
k= 1 

is the best approximation to w E X with respect to U if and only if the 
coefficients a \ , . . . , a n satisfy the normal equations 

n 

Y^ak{uk,uj) = (w,Uj), j = 1, ... ,n. (3.16) 

fc=i 

Proof The normal equations (3.16) obviously are equivalent to (3.13). □ 

The normal equations for the best approximation in pre-Hilbert spaces 
provide further examples of systems of linear equations. The solution be- 
comes trivial if the basis u\, . . . ,u n is orthonormal. 

Corollary 3.54 Let U be a finite- dimensional linear subspace of a pre- 
Hilbert space X with orthonormal basis ui,...,u n . Then the orthogonal 
projection operator is given by 

n 

Pw = ^2(w,u k )u k , W € X. 
k = 1 

Proof This is trivial from either the orthogonality condition of Theorem 
3.51 or the normal equations of Corollary 3.53. □ 

Problems 

3.1 Show that (3.1) defines a norm on (D n for p > 1 and that 

lim ||x|| p = ||x||oo 

p— * OO 

for all x6C n . 
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3.2 Indicate the closed balls { x G 1R 2 : ||x|| p < 1} for p = 1,2, oo. What 
properties do they have in common? 

3.3 Show that (3.1) does not define a norm on <D n for 0 < p < 1. 

3.4 For the t\ and too norms on C n show that ||x||oo < ||x||i < n||x||oo- 

3.5 Let X and Y be normed spaces with norms || • ||x and || * ||y, respectively. 
Show that 

||(*,y)|| := MU+IMIv, 

ll(*,»)ll “ (11*111+ ll»ll?-) 1/a , 

ll(*.»)ll ~ max (||*||x, ||y||y), 

for (x, y) G X x Y define norms on the product X x Y. 

3.6 Show that convergent sequences are bounded. 

3.7 Let (x„) be a sequence of elements of a normed space X. The series 

oo 

fc=l 

is called convergent if the sequence (S n ) of partial sums 

n 

S n •— ^ ] X k 

k=l 

converges. The limit S = lim n _ 4 oo 5 n is called the sum of the series. Show that 
in a Banach space X the convergence of the series 

oo 

£imi 

fc = l 

is a sufficient condition for the convergence of the series Xk an( ^ 

oo oo 

k = 1 A: = l 

3.8 A norm || • || a on a linear space X is called stronger than a norm || • ||& if every 
sequence converging with respect to the norm || • || a also converges with respect 
to the norm || • ||&. Show that || • || a is stronger than || • ||^ if and only if there 
exists a positive number C such that ||x||& < C||x|| a for all x G X. Show that on 
C[a, b] the maximum norm is stronger than the L 2 norm (and stronger than the 
Li norm). Construct a counterexample to demonstrate that the maximum norm 
and the L 2 norm (and the maximum norm and the L\ norm) are not equivalent. 

3.9 Show that in a normed space the operations of addition and multiplication 
by a scalar are continuous functions. Show that in a pre-Hilbert space the scalar 
product is a continuous function. 
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3.10 Show that a norm ||.|| on a linear space X is generated by a scalar product 
if and only if the parallelogram equality 

Ik + yf + Ik - y \\ 2 - 2(||x || 2 + Hj/II 2 ) 

holds for all Show that the t\ and £<*> norms on € n are not generated 

by scalar products. 

3.11 Let A be a positive definite n x n matrix and denote by (• , •) the Euclidean 
scalar product on (D n . Show that ( Ax,y ) defines a scalar product on (D n . 

3.12 Show that eigenvectors of a matrix for different eigenvalues are linearly 
independent. 

3.13 Let X and Y be normed spaces and denote by L(X,Y) the linear space 
of all bounded linear operators A : X — > Y. Show that L(X, Y) equipped with 

Pll := sup ||Az|| 
ll*ll=i 



again is a normed space and that L(X,Y) is a Banach space if Y is a Banach 
space. 

3.14 Let A : X — > X denote an operator from a normed space X into itself. 
The iterated operators A n : X — >• X , n = 0,1,..., are defined recursively by 
A 0 = I and A n := AA n_1 for n E IN. If A is bounded and linear, show that 

imi < pir. 

3.15 Show that for n x n matrices A the series 




k=0 



converges (with respect to any norm on C n ), and denote the sum of the series by 
e A . Show that if A is an eigenvalue of A, then e A is an eigenvalue of e A . 

3.16 Show that if B : X -* X is a linear operator on a Banach space X with 
pii < 1, then the Neumann series 

oo 

Y,B k = (I-B )~ 1 

k= 0 

converges in the Banach space L(X,X). 

3.17 Let U be a complete subset of a normed space X and let A : U — > U be 
a continuous operator, and assume that A m is a contraction for some m € IN. 
Show that A has a unique fixed point and that the successive approximations 
x n + 1 := Ax n , n = 0, 1, . . . , with arbitrary xo € U converge to this fixed point. 

3.18 A subset U of a normed space X is called sequentially compact if each 
sequence from U contains a convergent subsequence with limit in U. Let U be a 
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complete and sequentially compact subset of a normed space X and let A : U — > U 
be an operator with the property 

\\Ax- Ay\\ < ||x — 3/|| 

for all x,y G U with x ^ y. Show that A has a unique fixed point and that the 
successive approximations x n +i := Ax n , n = 0,1,..., with arbitrary xo € U 
converge to this fixed point. 

3.19 Let {u n : n G IN} be an orthonormal system in a pre-Hilbert space X. 
Show that the following properties are equivalent: 

(a) span{tt n : n G IN} is dense in X 

(b) Each ip G X can be expanded in a Fourier series 

oo 

9 - 

n=l 

(c) For each ^Glwe have ParsevaVs equality 

oo 

|M| 2 = £|(<p,« n )| 2 . 

n = 1 



Show that properties (a)-(c) imply that 

(d) x = 0 is the only element in X with (x, u n ) = 0 for all n G IN, 
and that (a), (b), (c), and (d) are equivalent if X is a Hilbert space. 

3.20 Show that the best approximation to a function / G C[ 0, 2n\ in the L 2 
norm with respect to the space of trigonometric polynomials of degree at most n 
is given by the partial sum 

n n 

{Pnf){x) = ^ + E ak cos kx + bk sin fex, x G [0, 27r], 

ife=i fc = i 

of the Fourier series of / with the Fourier coefficients 

1 f 2n 1 f 2lx 

ctk = — / f(x) cos kxdx, bk = — / f (x) sin kxdx. 

* Jo n Jo 




4 

Iterative Methods for Linear Systems 



This chapter is devoted to applying the analysis developed in the previous 
chapter to the iterative solution of systems of linear equations. In particular, 
we will discuss in detail the Jacobi and the Gauss-Seidel iterations, which 
essentially go back to Gauss. In Supplementum Theoriae Combinationis 
Observationum Erroribus Minime Obnoxia , published in 1822, Gauss used 
a variant of the Gauss-Seidel method for the solution of the linear systems 
arising through his least squares method, since they were too large for 
elimination methods. 

With the advent of computers the size of the linear systems that could 
be solved grew enormously, leading to the requirement of speedup of the 
convergence of the classical Jacobi and Gauss-Seidel iterations. In this 
context, we will introduce the reader to the idea of relaxation methods, 
including a typical example that illustrates the dramatic gain in the speed 
of convergence by overrelaxation. We will conclude the section with the idea 
of defect correction iteration and indicate its application to the very efficient 
solution of the large linear systems arising from the discretization of linear 
differential and integral equations by two- grid and multigrid methods. 



4.1 Jacobi and Gauss-Seidel Iterations 

We start by supplementing the sufficient condition of Theorem 3.48 for 
convergence of the method of successive approximations by establishing a 
necessary and sufficient condition for the finite-dimensional case. 
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Theorem 4.1 Let B be an n x n matrix. Then the successive approxima- 
tions 

x u +i := Bx v + z, v — 0, 1, 2, , 
converge for each z G C n and each xq G € n z/ and only if 

p{B) < 1 



for the spectral radius of B. 

Proof If p(B) < 1, then by Theorem 3.32 there exists a norm || • || on <D n 
such that | |B|| < 1. Now convergence follows from Theorem 3.48 together 
with the equivalence of all norms on <D n according to Theorem 3.8. 

Conversely, suppose that convergence holds. If we assume that p(B) > 1, 
then there exists an eigenvalue A of B with |A| > 1. Let x denote an as- 
sociated eigenvector. Then the successive iterations for the right-hand side 
z — x and the starting element xq — x lead to the divergent sequence 
x v — (5^ =0 ^ k ) x - This is a contradiction. □ 



We note that Theorem 4.1 remains valid for bounded linear operators 
B : X — > X in infinite-dimensional Banach spaces with the definition of 
the spectral radius appropriately modified. However, the proof requires a 
different and deeper analysis. 

For the iterative solution of a system of linear equations of the form 

Ax = y 

we distinguish different methods by the way in which the original system 
is transformed into an equivalent fixed-point form. We decompose A by 

A ~ D + Al -I- Ar 



into a diagonal matrix 

D — diag(an , . . • , o, nn )^ 
a proper lower (left) triangular matrix 



Al — 



and a proper upper (right) triangular matrix 

/ 0 fli2 &13 • • «i n \ 





0 




\ 




«21 


0 






a 31 


«32 0 






a nl 


a n 2 


^n,n— 1 0 / 



Ar = 



0 a23 • • «2n 

0 Q>n— l,n 

o / 
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We assume that all the diagonal entries of A are different from zero. Hence 
the inverse D~ x of D exists. 

In the method attributed to Jacobi, which is sometimes also called the 
method of simultaneous displacements, the system Ax — y is transformed 
into the equivalent form 

X = -D~ l {A L + A r )x + D~'y, 



and the latter is solved by successive approximations 

Xu+i ■= -D' 1 ( A l + Ar)x„ + D~ 1 y, v = 0, 1, 2, ... , 



with arbitrarily chosen starting element x$. Written in components, one 
step of the Jacobi iteration scheme reads 

n 

E a jk , Vj 

— + — , 3 = 

k=zl a jj a jj 

*7 *j 



Theorem 4.2 Assume that the matrix A = ( ajk ) satisfies 



Qoo := . max Y] 



k = 1 
k^j 



a jk 



*33 



< 1 



(4.1) 



or 



or 



:= max > 
k=l,...,n ' 



Q2 := 



E 



J = 1 



a jk 



a jk 



*33 



< 1 



1/2 



a 33 



< 1. 



\% x J 



(4.2) 



(4.3) 



Then the Jacobi method, or method of simultaneous displacements, 



E cijk y j 

%v,k "I - 5 j = 1 5 • ■ • j n, v — 0, 1, 2, ... , 

fc=1 a 33 
k^j 

converges for each y G (D n and eac/i xo E (D n £0 the the unique solution of 
Ax = y (in any norm on <E n ). For /i = 1,2, oo, if < 1, we have the a 
priori error estimate 



\\x v - x 



u< 



% 



1 - Qn 



11*1 - *olU 
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and the a posteriori error estimate 

\\x v - xWn < ■ J \\x v - Xv-iWn 

for all v G IN. 

Proof . The Jacobi matrix — D~ x (Al + Ar) has diagonal entries zero and 
off-diagonal entries —djk/ajj- Hence by Theorem 3.26 we have 

II — D 4“ ^4jr)||oo — Qooi 

|| — D 1 (Al -f A/e)||i = </i, 

|| - D~ 1 (Al H- Ar ) || 2 < Q2 • 

Now the assertion follows from Theorem 3.48. □ 



Note that the sufficient convergence conditions (4.1)-(4.3) are not equiv- 
alent. Roughly speaking, each criterion ensures convergence if the diagonal 
entries of A are dominant. The condition (4.1) can also be written as 

n 

Y K* I < \ a n l> J = !.•••> rc; ( 4 - 4 ) 

1 

i.e., the matrix A is required to be strictly row- diagonally dominant. From 
(4.2) it can be deduced (see Problem 4.4) that if 

n 

^2 \a jk \ < \dkkl = (4.5) 

3 = i 

i.e., if the matrix A is strictly column- diagonally dominant , then the Jacobi 
iterations converge. 

For the Gauss-Seidel method, which is also known as the method of 
successive displacements, we proceed differently and transform Ax = y via 

(D + A l )x = -A r x + y 

into the equivalent form 

x — — (D + Al) 1 Arx + (D + Al) l y<, 
which is then solved by the successive approximations 

x v +i := ~(D + A l )~ 1 A r x„ + (D + A L )~ 1 y, u = 0 , 1 , 2 , . . . , 
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with arbitrarily chosen starting element Xq. For the actual computations 
we rewrite this as 



(D + A l )x„+i = -A r x„ + y, v = 0,1,2, 



and solve the linear system for x p +i with the lower triangular matrix D+Al 
by forward substitution. This leads to the Gauss-Seidel iteration scheme 
in the following explicit form: 



i-i 



E Ujk 

Xv-\-l,k 

k = i a " 



£ 



&jk Uj . 

Xv,k “b 5 J — 1, . . . , 71. 

k= J+ i ai 



*33 



Here and in the sequel empty sums have to be interpreted as zero. 

In the Jacobi iteration scheme all the components of the new approx- 
imation vector x v +\ are obtained by using only the components of the 
previous approximation vector x v , which explains why this method is also 
called the method of simultaneous displacements. However, in the Gauss- 
Seidel iterations each new component of x v +i is immediately used in the 
computation of the next component; i.e., for computing the jth compo- 
nent the values £i/+i,i>#i/+i, 2 , • . . are already used. This 

is very convenient for computer calculations, since the new values can be 
stored in the locations held by the old values, which reduces the storage 
requirements. 



Theorem 4.3 Assume that the matrix A — ( a,jk ) fulfills the Sassenfeld 
criterion 

p max pj < 1, 

j=l,...,n 

where the numbers pj are recursively defined by 



pi := E 



k —2 



O'lk 

an 



Pj 



3-1 

E 

k = 1 



Cljk 



*33 



Pk+ E 

k=j + 1 



a jk 



j = 2,...,n. 



Then the Gauss-Seidel method, or method of successive displacements, 



3-1 

x v+i ,j — ~ 

k = 1 



Q>jk 

x v+\,k 

a 33 



E 7 t av.* + ;7 L , j = v = 0,1,2, , 

k=j+l jj jj 



converges for each y G C n and each x 0 G C n to the the unique solution of 
Ax — y (in any norm on <E n ). We have the a priori error estimate 

p v 

\\x u — x||oo < - ||#1 — Xo||oo 

1 ~P 

and the a posteriori error estimate 

P 

\\ x v ~ ^||oo ^ ^ ~ \\ x v ~ %v— l||oo 



for all v G IN. 
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Proof. Consider the equation 

(D + A l )x = -A r z 
for z 6 € n with ||^||oo = 1, that is, 




By induction, this implies that \xj\ < pj for j = l,...,n, and therefore 
Iklloo < P- Hence we have 

||(Z> + Al^ArWoc < p, 

and the assertion of the theorem follows from Theorem 3.48. □ 

Corollary 4.4 Assume that the matrix A is strictly row- diagonally domi- 
nant. Then the Gauss-Seidel iterations converge. 

Example 4.5 The tridiagonal matrix 




from Example 2.1 is not strictly row- diagonally dominant , but it satisfies 
the Sassenfeld criterion. 

Proof. Obviously, <7oo = 1; i.e., (4.1) is not fulfilled. We have the recursion 

1 1 1 . , 1 
Pi = g ’ Pj = 2 ft’- 1 2 ’ J = 2 >* p„ = -Pn-i. 

From this, by induction, it follows that 

Pj = 1 “ b ’ j = 

Therefore, 

p = 1 - ^r—r < 1, 

^ 2 n— 1 ’ 



and this implies convergence of the Gauss-Seidel iterations by Theorem 
4.3. □ 
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Since the matrix A is tridiagonal, the system Ax — y can be solved 
efficiently by elimination (see Problem 2.9). Nevertheless, this matrix pro- 
vides a very suitable example for the analysis of iterative methods for lin- 
ear systems arising in the discretization of ordinary and partial differential 
equations. This is due to the fact that in more general cases, for exam- 
ple for the linear system of Example 2.2, there are more technical details 
to consider, which distract from the basic principles. However, these basic 
principles do not depend on the dimension of the underlying differential 
equation problem. 

In Example 4.5, if n is large, the contraction number p will be close to 
one, i.e., the convergence rate of the Gauss-Seidel iterations will be unsat- 
isfactorily slow. Before we indicate how the convergence can be accelerated, 
we continue by discussing a weaker form of row-diagonal dominance. 

Definition 4.6 An nxn matrix A = ( ajk ) is called reducible if there exist 
two nonempty sets N,M C {1, . . . , n} such that 

NnM = ®, NUM = {l,...,n}, 

and 

a jk — 0, j G iV, ke M. 

Otherwise the matrix is called irreducible. 

A reducible matrix A , after a reordering of the rows and columns, can 
be partitioned into a 2 x 2 block matrix of the form 

L) 

(see Problem 4.5). Therefore, solving a linear system with the matrix A 
can be reduced to solving two smaller linear systems with the matrices An 
and A 2 2 - 

Theorem 4.7 Assume that the matrix A = (ajk) is irreducible and weakly 
row-diagonally dominant; i.e., A is row- diagonally dominant , 

n 

K*l - l°wl» j = h---,n, (4.6) 

k=l 

k^j 

with inequality holding for at least one row j. Then the Jacobi iterations 
converge for each y G (D n and each x$ G (D n to the unique solution of 
Ax — y (in any norm on <E n ). 

Proof. By (4.6) and Theorem 3.26 we have that ||i?||oo < 1 for the Jacobi 
matrix B = — D~ x (Al -I- Ar). Therefore, from Theorem 3.32 it follows that 
p(B) < 1 for the spectral radius. 
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Now assume that there exists an eigenvalue A of B with |A| = 1. For the 
associated eigenvector we may assume that ||x||oo = 1. Then from Xx = Bx 
we obtain the inequality 




Let N := {j : \xj\ = 1}. Since ||#||oo = 1, we have that N ^ 0. For j E N 
we have |A| \xj\ — 1, and therefore equality holds in (4.7); i.e., 



£ 

k = 1 
k^j 



O'jk 

a 33 



jeN. 



From this it follows that 



since A is weakly row-diagonally dominant. Because A is irreducible, there 
exists jo E N and ko E M such that aj 0 k 0 ^ 0. Now by using 

l a jofco I \ x ko | < \ a jok 0 I 
we obtain the contradiction 



n 

i = kjol = \M l*A.I = S 



k = 1 



a jpk 

a jojo 



1**1 < E 

/c=l 



a jpk 

a jpjp 



< 1. 



k^j k^j 

Therefore, we have p(B) < 1, and the statement of the theorem follows 
from Theorem 4.1. □ 



We leave it to the reader as an exercise to show that the matrix A 
from Example 4.5 is irreducible and weakly row-diagonally dominant (see 
Problem 4.6), implying convergence of the Jacobi iterations. 



4.2 Relaxation Methods 

From combining the a priori error estimate of Theorem 3.48 with Theorem 
4.1 we see that the spectral radius p(B) of the iteration matrix B may 
be considered as a measure for the speed of convergence of the successive 
approximations. Therefore, it is desirable to design the iterative scheme 
such that p(B) becomes small. This aim is the motivation of the relaxation 
methods to be discussed in this section. 




4.2 Relaxation Methods 61 



Each step of the Jacobi iterations can be written in the form 
x v+ i =x„ + D^ 1 (y - Ax„), 

indicating how the new approximation x v+ i is obtained by correcting the 
previous approximation x v . The basic idea of the relaxation methods is 
to multiply the correction term by some weight factor. Note that if the 
following relaxation iterations converge, then they converge to a solution 
of Ax = y. 

Definition 4.8 The iterative scheme 



x„+i := x v + ujD l (y-Ax v ), v - 0, 1, 2, . . . , 



i.e., in components 



u 

%v-\-l,j “ %v,j d" 

a 33 



n 

Vj ^ ^ Cljk-Evik •> 
k= 1 



j = l,...,ra, 



is known as the Jacobi method with relaxation. The weight factor u > 0 is 
called the relaxation parameter. 

Theorem 4.9 Assume that the Jacobi matrix B := —D~ 1 (Al -f Ar) has 
real eigenvalues and spectral radius less than one. Then the spectral radius 
of the iteration matrix 

I - ujD~ 1 A = (1 - u)I - uD~ l (A L + A r ) 



for the Jacobi method with relaxation becomes minimal for the relaxation 
parameter 

2 

u °pt — 9 _ X 3~7 

& -''max -^min 

and has spectral radius 

/t n — 1 a \ ^max — ^min 

p(I - Wopt D A) = — — — , 

^ ''max ^min 

where A m i n and A max denote the smallest and the largest eigenvalue of B , 
respectively. In the case A m i n / — A max the convergence of the Jacobi method 
with optimal relaxation parameter is faster than the convergence of the 
Jacobi method without relaxation. 



Proof. For u > 0 the equation Bu = Xu is equivalent to 
[(1 — a ;)/ + ljB]u — [1 — u H - uj\\u. 

Hence the eigenvalues A of B correspond to the eigenvalues 1 — uj 4- to A of 
(1 — uj)I + ujB. Therefore, the eigenvalues of (1 — u)I + wB are real, and 
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the smallest eigenvalue of (1 — u ;)/ A wB is given by 1 - u + u>A m i n and the 
largest by 1 — u + cjA max - Obviously, the spectral radius becomes minimal 
if the smallest and the largest eigenvalue are of opposite sign and have the 
same absolute value, i.e. , if 

1 ^opt 4" ^optAmin — 1 4" ^opt ^optAmax* 

From this, elementary algebra yields the optimal parameter lj op t and the 
spectral radius p(I — u opt D~~ 1 A) as stated in the theorem. □ 

For the Gauss-Seidel iterations, from ( D 4- Al)x v +\ — —Arx v Ay it 
follows that 

X„+1 =X„ + D~ l [y - A l x u+ 1 - (D + A r )i 

Hence, the corresponding relaxation method is defined as follows. Note 
again that if the relaxation iterations converge, then they converge to a 
solution of Ax = y. 

Definition 4.10 The iterative scheme 

x„+i = x v +uD~ 1 [y - A l x v+ 1 - (D + A r )x v ], v = 0,1,2,..., 
i.e., in components 

j- 1 

Vj ^ > Q / jk3'is+l,k 
k= 1 

is known as the Gauss-Seidel method with relaxation or as the successive 
overrelaxation (SOR) method with relaxation coefficient u > 0. 

From 

(. D A ljAl)x u + i —ujy A [(1 - u)D - uA R \x v 
we obtain that the iteration matrix of the SOR method is given by 

B(u) := (D + ojA l )- 1 [{1 - oj)D - uA R }. 

Here, as opposed to the relaxation of the Jacobi method, the iteration 
matrix depends nonlinearly on the relaxation parameter. This makes the 
convergence analysis of the SOR method more complicated. 

Theorem 4.11 (Kahan) A necessary condition for the SOR method to 
be convergent is that 0 < uj < 2. 

Proof. Since the eigenvalues , . . . ,// n of B(u) are the zeros of the char- 
acteristic polynomial, they satisfy 

n 

fij = det B(uj) 
j = i 
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(where multiple eigenvalues are repeated according to their algebraic mul- 
tiplicity). From this, by the multiplication rules for determinants and since 
D + wAl and (1 — u)D — uAr are triangular matrices, it follows that 

n 

fij = det (D + 1 oAl)~ 1 det[(l - u)D - uAr] = (1 - u) n . 
j = 1 

This now implies 

p[B{ u>)\ > |1 -u\, 

and from Theorem 4.1 we conclude the necessity of 0 < u < 2 for conver- 
gence. □ 

Theorem 4.12 (Ostrowski) If A is Hermitian and positive definite , then 
the SOR method converges for all xo G C n , all y E (D n , and all 0 < u < 2 
to the unique solution of Ax — y. 

Proof Let /i be an eigenvalue of B(u) with eigenvector x ; i.e., 

[(1 — u)D — ljAr]x = p(D + uAl)x. 



With the aid of 

(2 — l u)D — l oA — u(Ar — Al) = 2[(1 — lj)D — cjAr] 



and 

(2 — u)D + ujA — &(Ar — Al) — 2 \D -h 
we deduce that 

[(2 — u)D — a jA — uj(Ar — Al)]x = fi[( 2 — uj)D -f uA — u(Ar — Al)]x. 

Taking the Euclidean scalar product with x , it now follows that 

(2 — l S) d — uoa + ius 
^ (2 — c S)d + uja + ius 



where we have set 

a := (Ax,x), d := (Dx,x), s := i(ARX — Alx,x). 

Since A is positive definite, we have a > 0 and d > 0, and since A is 
Hermitean, s is real. From 

|(2 - u)d — ua\ < |(2 — uj)d 4 - ua\ 

for 0 < u) < 2 we now can conclude that |/x|<lfor0<o;<2. Hence 
convergence of the SOR method for 0 < u < 2 follows from Theorem 4.1. □ 
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The calculation of the optimal relaxation parameter, i.e., the parameter 
minimizing the spectral radius, is difficult except in some simple cases. 
Usually it is obtained only approximately by trial and error, based on trying 
several values of u and observing the effect on the speed of convergence. 
However, the effort is well worth the time, since the resulting improvement 
of the convergence can be considerably large, as we will indicate by the 
following analysis, which relates the convergence of the SOR method to 
that of the Jacobi method for a certain class of matrices that occurs in the 
discretization of boundary value problems. 

Definition 4.13 A matrix A = D + Ar + Ar with nonsingular diagonal 
D is called consistently ordered if the eigenvalues of 

C{a):=-aD- 1 A L --D- 1 AR, a € C \ {0}, 

a 

do not depend on a. 

The following theorem ensures that the analysis we are going to develop 
applies to the matrix of Example 2.1, i.e., of Example 4.5. 

Remark 4.14 Tridiagonal matrices with nonzero diagonal elements are 
consistently ordered. 

Proof After introducing the diagonal matrix 

S(a) := diag(l,a,a 2 , . . . ,a n_1 ) 
for tridiagonal matrices A = D + Ar + Ar , we have that 
5(a)C(l)5(a)" 1 = C(a); 

i.e., all matrices C(a) are similar, and therefore they have the same eigen- 
values. □ 

Without going into detail, we wish to say that a much wider class of 
matrices arising in the discretization of differential equations enjoys the 
property of being consistently ordered in the sense of Definition 4.13. For 
a more comprehensive study we refer to [61, 63, 66]. 

Theorem 4.15 (Young) Assume that A is a consistently ordered matrix 
and that the eigenvalues of the Jacobi matrix —D~ 1 (Al + Ar) are real 
with spectral radius A = p[—D~ 1 (Al 4- Ar)] < 1. Then the SOR method 
converges for all 0 < u < 2. The spectral radius of the SOR matrix B{uS) 
is minimal for 

Wopt = 1 + Jrr& - L 

In this case we have 

r „, M _ i - vT^A 1 

p[BM]- 1 + VT ^J5- 
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Proof. Prom 

(7 + uD~ l A L )\fiI - B(w)] = /z(7 + wD-'Al) - - u)D - ljAr] 



— (fi 4- u> — 1)7 + y fji oj (yji D 1 Ar 4- D 1 



and the fact that I + c oD 1 Al is nonsingular it can be seen that /x / 0 is 
an eigenvalue of B (cj) if and only if 



^ _ /x 4- u — 1 
y/JiU 



(4.8) 



is an eigenvalue of 



-^lD- 1 A l -^-D- 1 A r . 

VJi 

Since A is assumed to be consistently ordered, it follows that /x / 0 is an 
eigenvalue of B(cj) if and only if A is an eigenvalue of —D~ 1 (Al 4- An). 
Solving the quadratic equation 

/x 4 - lj — 1 = y/JI ljX 



yields 




cj 2 A 2 

4 



4-1 — co 



2 



Setting a = — 1 in Definition 4.13, it is obvious that if A is an eigenvalue 
of — D~ 1 (Al 4- An), then -A also is an eigenvalue of -D~ 1 (Al 4- Ar). 
Therefore, since we are interested only in the spectral radius of B{u), we 
can confine our considerations to 



/x = 




cj 2 A 2 

4 



4- 1 - u 



2 



Because of |A| < 1, the quadratic equation 

J 2 A 2 - 4w 4 4 = 0 



has two real solutions, and only one of them belongs to the interval (0,2), 
namely 



^o(A) = 



2 

i + v^t^a 2 



> 1. 



This implies that 



uP“ y? — 4c o 4- 4 > 0, 0<a;< cjq(A). 
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Therefore, we have 

l/*MI 



~ ( 2 + V 4 



+ 1 — C J I , 0 < CJ < CJo(A). 



(4.9) 



For ujo(X) < uj < 2 the eigenvalues are complex, with 

\fi(u)\ = uj - 1, tc>o(A) <uj <2. (4.10) 

From the expressions (4.9) and (4.10) it can be seen that \p{uj)\ is mono- 
tonically nondecreasing with respect to |A|. Hence 

\ 2 



p[B(u))] = 



f ( ujK Iuj 2 A 2 



+ 1 - UJ 



1^-1, 



0 < uj < uj o(A), 



(Jo(A) < uj < 2. 



(4.11) 



The function 



cjA uj 2 A 2 

/M : = -y + Y —4— + 1 - W 



has the properties /(0) = 1 and 



... . A wA 2 - 2 

/ M = o + ~ -■■■■= < o. 



2 2\Juj 2 K 2 + 4-4 u 



The latter follows from 



A 2 (4 - 4uj 4 - lj 2 A 2 ) < 4 - 4 u ; A 2 4 - 6 t > 2 A 4 = (2 - a > A 2 ) 2 . 

Therefore, the spectral radius described by (4.11) is strictly monotonically 
decreasing for 0 < uj < u?o and strictly monotonically increasing for 
ujq < uj < 2 (see Figure 4.1). Since p[B( 0)] = p[B(2)\ = 1, we finally 
obtain that p[B(uj)] < 1 for all 0 < uj < 2 and that p[B(uj)\ assumes its 
minimum for uj — u;o(A) with value p[B(uJo(A))] = u;o(A) - 1. □ 

Corollary 4.16 Under the assumptions of Theorem 4-15 the Gauss-Seidel 
method converges twice as fast as the Jacobi method. 

Proof. From (4.8) we observe that p — A 2 for uj = 1; i.e., we have 

p[B(l)] = {p[-D- l (A L +A R )]} 2 

for the spectral radii of the Gauss-Seidel matrix B(l) and the Jacobi ma- 
trix — D~ 1 (Al + Ar). Now the statement follows from the observation that 
by the a priori estimate of Theorem 3.48 the number N of iterations re- 
quired for a desired accuracy is inversely proportional to the modulus of 
the logarithm of the spectral radius; i.e., 

AT (Gauss-Seidel) ^ In p[-D~ l (A L + A R )\ _ 1 
iV(Jacobi) ~ ln/9[i?(l)] 2 

and this proves the assertion. □ 
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p[B(u>)\ 




FIGURE 4.1. Spectral radius for SOR 

Example 4.17 For the tridiagonal matrix A from Example 4-5 we have 

N{ SOR) _ 7T 

TV(Jacobi) ~ 4 (n + 1) 

for the optimal relaxation parameter. 

Proof. Using the trigonometric addition theorem 

1 . 7T7* (A: — 1) 1 . 7rj(A;+l) nj . njk 

- sin — h - sin — = cos sin , 

2 n- hi 2 n- hi n + 1 n-f 1 



it can be seen that the Jacobi matrix 



-D 1 (A l + Ar) = 



/ 0 1 
1 0 1 
1 0 1 



V 



1 0 1 

1 0 / 



corresponding to Example 4.5 has the eigenvalues 



A j = cos 



7 rj 

n - hi 



, j = 1, . . . ,n, 



and associated eigenvectors Vj with components 

. njk 



Vj, k = sin 



n + 1 



, fe = 1, . . . , n, j = 1, . . . , n. 



Hence, 



A = p[-D 1 (A l + A r )] = cos 



n + 1 



2(n + l) 2 
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and 



-In p[-D 1 {A l + A r )\ 



2(n + l) 2 ' 



From Theorem 4.15 we obtain 



^opt 



2 



1 + sin 



7T 

n + 1 



and 

1 - sin ^ cy 

^ « i - -7 . 

1 + sin 71 + i 

n + 1 

whence 

27T 

-lnp[B(w op t)] » 

follows. This concludes the proof. □ 

For example, for n = 30 the optimal SOR method is about forty times 
as fast as the Jacobi method. Note that the improvement on the speed 
of convergence improves as n increases. The fact that in Example 4.17, 
and, more generally, in almost all linear systems arising in the discretiza- 
tion of boundary value problems, the optimal relaxation parameter has the 
property uj > 1 explains why the method is known as the overrelaxation 
method. 



4.3 Two-Grid Methods 

Consider the linear system 

Ax = y (4.12) 

with a nonsingular matrix A, and assume that we already have an approx- 
imate solution xq available with a residual , or defect , 

r 0 :=y - Ax 0 , 

for which, in general, r 0 / 0. Then we try to improve on the accuracy by 
writing 

xi=xo+5 0 (4.13) 

with some correction term Jo- Substituting this into (4.12) we obtain that 
Jo has to satisfy the defect correction equation 



Ado - ro 
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in order that x\ satisfy (4.12). We observe that the correction term will, 
in general, be small compared to xo, and therefore it is unnecessary to solve 
the defect correction equation exactly. Hence we write 

So ^approx ^*0? 

where ^approx ls some approximation for the inverse A~ l of A. Substituting 
this into (4.13) we obtain 

**T ^0 4“ ^approx [V ~ ^*^o] (-f ^approx^)*^0 d“ ^approx V (4.14) 

as our new approximate solution to (4.12). This procedure is known as the 
defect correction principle. 

Repeating this process yields the defect correction iteration defined by 

x v+1 := X v + ^“pproxb “ Ax A, ^ = 0, 1, 2, . . . , (4.15) 

for the solution of (4.12). By Theorem 4.1, the iteration (4.15) converges 
to the unique solution x of A~^ prox [y — Ax] = 0, provided that the spectral 
radius of the iteration matrix I—A~^ prox A is less than one. Since the unique 
solution x of Ax = y trivially satisfies A~^ prox [y — Ax] = 0, we then have 
convergence of the scheme (4.15) to the unique solution of (4.12). For a 
rapid convergence it is desirable that the spectral radius be close to zero, 
which will be the case if A~^ prox ls a reasonable approximation to A~ l . For 
a more complete introduction to the defect correction principle we refer to 
[56]. 

Here we wish to indicate briefly two applications. Firstly, the defect cor- 
rection principle (4.14) can be used to improve on the accuracy of an 
approximate solution xq, obtained for example by Gaussian elimination. 
Then, in principle, the computation of Xo corresponds to some approxima- 
tion xo = A~p prox y obtained from an LR decomposition. This means that 
evaluating do = ^approx r o is achieved by applying again the same elimi- 
nation algorithm to the defect correction equation. This way, the defect 
correction principle provides a simple tool to improve on the accuracy of a 
solution to a linear system obtained by elimination. 

Secondly, we would like to illustrate the more systematic use of the defect 
correction principle for the development of multigrid methods as a powerful 
tool for the fast iterative solution of linear systems arising in the discretiza- 
tion of differential and integral equations. For the sake of simplicity we will 
confine ourselves to the case of two-grid iterations. 

The basic idea of two-grid methods is to use the defect correction princi- 
ple with the approximate inverse A~^ rox for the matrix ^4fi ne of a large lin- 
ear system corresponding to a fine approximation grid given simply by the 
exact inverse of the matrix hoarse of a smaller linear system, correspond- 
ing to a coarse approximation grid. Of course, a number of mathematical 
problems arise in the design of such methods concerning the appropriate 
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relation between the fine and coarse grid and the transfer between the two 
grids. We will outline some ideas on the structure of two- grid methods by 
again considering the simple model problem from Example 2.1 as a typical 
case. 

Recall that the solution vector G IR n of the linear system 



__ p(h) 



(4.16) 



with the n x n tridiagonal matrix 





f 2 -1 
-1 2 -1 


\ 


A(h) = i 


-1 2 


-1 


h? 




-1 2 -1 




V 


-1 2 / 


corresponds to approximate values ? 


« u(jh), j = 1, 



solution u of the boundary value problem (2.1)-(2.2) at the internal grid 
points. Since we want to make use of two different grids in our analysis, we 
indicate the dependence on the mesh width 



h = 



1 

n + 1 



in the matrix and the solution U^ h \ We assume that n is odd because 
later we want to choose the coarser grid by doubling the mesh width. 

We start from the Jacobi iteration with relaxation 



ui^i =ul h) -F w ], v = 0,1,2, , ( 4 . 17 ) 

as introduced in Definition 4.8. From our analysis in Example 4.17 we 
deduce that A^ has the n eigenvalues 

fij - sin 2 j — 1, . . . ,n, ( 4 . 18 ) 

and associated eigenvectors v ^ with components 

- sm(irjkh), k = l,...,n, j = l,...,n. (4.19) 

Note that by Theorem 3.29, the eigenvectors of the Hermitian matrix A^ 
form an orthogonal basis for IR n (see Problem 4.18). The Vj h \ j = 1, . . . , n, 
are also eigenvectors of the Jacobi matrix I —[D^]" 1 A^ h \ with eigenvalues 



A j = cos(7rjft), j = 1, . . . , n. 
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From Theorem 4.9 we observe that u = 1 is the optimal choice for the 
Jacobi iteration with relaxation. However, it will turn out that in the con- 
text of two-grid methods the damped , or underrelaxed , Jacobi method with 
0 < u < 1 is more important. This is due to the following observation. 
Since the Vj h \ j = 1 , . . . ,n, provide a basis for lR n , we can represent the 
difference between the exact solution and the vth iteration U v in the 
form 

-U„ = ^a jtV v f ) . 

3 = 1 



From the fact that 

{I - uj[D^]~ 1 = jl - 2a; sin 2 j v^ h \ j = 1, . . . ,n, 

we derive the recurrence relation 

f _ . o njh 1 

aj>+ 1 = S 1 - 2a; sin — > a jyU , j = 1, . . . , n, 
for the coefficients aj, v . In particular, if we choose u — 0.5, we have that 
otj,v+i = cos 2 aw, j = 1, . . . , n. (4.20) 



From this we observe that even though convergence of the iterations (4.17) 
becomes slower when we decrease u, for u = 0.5 the convergence restricted 
to the subspace 

W n := span{i>n+i , . . . , v n ] 

of high frequencies is dramatically accelerated, since in this case from (4.20) 
we have that 

i i ^ 1 i i n + 1 

lo^iz+il ^ 2 l a i» |/ l» J — 2 , . . . , n . 

This fact can be expressed by saying that the damped Jacobi iteration is a 
smoothing iteration. In the sequel we will consider only the damping factor 
u = 0.5. 

The slow convergence with respect to low frequencies will now be taken 
care of by the defect correction principle through incorporating a so-called 
coarse grid correction on the grid with mesh width 2 h. For this we need 
to transfer vectors corresponding to the fine grid to vectors correspond- 
ing to the coarse grid and vice versa. The transfer from the fine grid 
to the coarse grid requires a restriction and corresponds to a mapping 
R(h) : 1R" _► 

. Note that we only need to consider this mapping for 
the interior grid points. Instead of choosing the restriction (R^y)k = 2/2* ? 
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k — 1 , . . . , , for y £ IR n it turns out to be advantageous to also incor- 

porate information contained in the odd nodal points of the fine grid by 
using the restriction 

1 n — 1 

(■ R (h) y)k = 4 [V2k-\ + 2 2/2* + V2k+\], fc = 1, • • • , — 2~ , 

as illustrated in Figure 4.2. 

1 2 3 4 5 6 7 

I i»ii i i i i 




FIGURE 4.2. Restriction operator of the two-grid method for n = 7 



The corresponding matrix is 




r 2 1 



2 1 



V 



\ 



12 1 

12 1 / 



With the aid of elementary trigonometric manipulations one can establish 
the relation 



R^vf ] =c)vf h \ R (h) 



v ( h ) _ „2 ( 2/0 

V n+l-j ~ S j V j > 



J = !>•••> 



n — 1 



(4.21) 



between the eigenvectors (4.19) for the fine and the coarse grid (see Problem 
4.19). Here we have set 



jnh . inti 

cj = cos — , sj = sin — , j = 1, 



n - 1 
2 



The transfer from the coarse grid to the fine grid is called prolongation 
and corresponds to a mapping : IR -2- — y IR n . The simplest choice for 
p( h ) i s given by the piecewise linear interpolation (see Chapter 8) 



(■ P (h) yhk = Vk, 



k = I,--., 



n 



(P ih) yhk-i = ^ b)k + yk-i], 



k = l,.. 



n + 1 



2 
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for y G IR - * - , as illustrated in Figure 4.3. The corresponding matrix is 
given by = 2R^ ± . Either by direct computation or from (4.21) and 
the fact that the matrices and 2 are adjoint one can establish that 
(see Problem 4.19) 




(4.22) 



FIGURE 4.3. Prolongation operator of the two-grid method for n = 7 



Now we are in a position to use the n x n matrix p( h )[A^ 2h ^] l R ^ as 
the coarse-grid correction. Computing P^[A^ 2h ^]~ l R^y corresponds to 
first restricting the vector y G IR n to R^y G IR - *-, then solving the 
x system A^ 2h ^z — R^y by an elimination method, and finally 

prolonging the solution z G JR~^~ to P^z G IR n . Combining this coarse- 
grid correction with N steps of the damped Jacobi iteration in the sense of 
(4.14) now yields one step of the two-grid iteration scheme 

l/„+i = J N (U V ,FW) - P^iA^l^R^lA^UNiU^F^) - 

where Jn(U u ,F^) denotes the result of N steps of the damped Jacobi 
iterations (4.17) with starting element U v . Obviously, the iteration matrix 
corresponding to this two-grid method is given by 

T n = {I - j/ - I [Z)W]- 1 A (h) J . (4.23) 

For an investigation of the convergence for our two-grid iteration scheme 
we need to determine the spectral radius of T n. For simplicity we confine 
ourselves to the case where TV = 1; i.e., one step of the damped Jacobi itera- 
tion on the fine grid alternates with a coarse-grid correction by elimination 
on the coarse grid. We set T\ = T. 

Theorem 4.18 For the spectral radius ofT we have that p(T) = 0.5; i.e., 
the two- grid iterations converge. 
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Proof. We note that from (4.18) and (4.19), with h replaced by 2 ft, we have 
that 

A (U) v (2h) = 1 sin 2 (jtjh) V f h) = -i C)s)vf h \ 



whence 



\ A (2h),-l (2h) h 2 (2ft) . j 

1 J 2 4c 2 s 2 ^ J *>•••> 2 



n - 1 



follows. From this, using (4.20)-(4.22) and R^v^i = 0, it can be derived 

2 

that 

/ Tv^ h) \ 2 



Tv 



(4 



n+l-j 



0(,£ 



for j = 1, . . . , rL ^ : - and 



Since the matrix 



Tv W _ I (ft) 

1 V n + 1 — r. V n + 1 



(4.24) 



(4.25) 



Q 



-(ii) 



has the eigenvalues 0 and 2, from (4.24) and (4.25) it can be seen that the 
matrix T has the eigenvalues 



0 2 2 _ 1 , .. • - n + 1 



2S 3 C 3 = O Sil1 5 = ^ 



and the eigenvalue zero of multiplicity This implies the assertion on 



the spectral radius of T. 



□ 



Theorem 4.18 shows that the two-grid method is a very fast iteration. As 
compared to the classical Jacobi and Gauss-Seidel methods and also to the 
SOR method with optimal relaxation parameter, it decreases the spectral 
radius from a value close to one to one-half, which causes a substantial 
increase in the speed of convergence. However, for practical computations 
it has the disadvantage that in each step the solution of a system with half 
the number of unknows is required. 

This drawback of the two-grid method is remedied by the multigrid 
method. Whereas for the two-grid method as described above only two 
grids are used, the multigrid method uses M > 2 different grids with mesh 
widths h n = 2^/i,/i = 1, . . . , M, obtained from the mesh width h on the 
finest grid. The multigrid method is defined recursively. The method for 
M - hi grids performs one or several steps of the damped Jacobi iteration 
on the finest grid with mesh width h and uses as approximate inverse for 
the defect correction one or several steps of the multigrid iteration on the 
M grids with mesh widths 2 h, 4 h, . . . , 2 M h. To be more explicit, the three- 
grid method uses one or several steps of the two-grid method as the defect 
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correction of the damped Jacobi iteration on the finest grid; the four-grid 
method uses one or several steps of the three-grid method as the defect 
correction; and so on. To describe further details of the multigrid method, 
in particular showing that the computational cost of one step of a multigrid 
iteration is proportional to the cost of the Jacobi iterations on the finest 
grid provided that the coarsest grid is coarse enough, is beyond the aim of 
this introduction. For a comprehensive study we refer to [8, 26, 29, 63]. 



Problems 



4.1 Consider the solution of the linear system 

5xi — 2x3 = — 1 

— 4xi 4- 8 x 2 4 2 x 3 = 18 



5x2 4 9x3 = 37 



by the Jacobi method. Give an estimate on the number of iterations needed to 
ensure that \\x u — x||oo < 10 -3 if the iteration is started with xo = (0, 0, 0) T . 

4.2 Write a computer program for the Jacobi method, the Gauss-Seidel method, 
and the SOR method and test it for various examples. 

4.3 Show that a matrix A has spectral radius p(A) < 1 if and only if it satisfies 
lim^oo A u = 0. 



4.4 Prove that the Jacobi method converges for strictly column-diagonally dom- 
inant matrices (compare (4.5)). 



4.5 Show that an n x n matrix A is reducible if and only if there exists an nxn 
permutation matrix P such that 

P~ l AP =( A . 11 .° ], 

y A2\ A 22 J 

where An is a k x k matrix and A 22 is an ( n—k ) x (n—k) matrix with 1 < k < n— 1 . 



4.6 Show that the matrix A from Example 4.5 is irreducible and weakly row- 
diagonally dominant. 

4.7 Let 

( 1 a a \ 

a 1 a ) . 

a a 1 / 

Show that for 1 < 2a < 2 the Gauss-Seidel method is convergent and the Jacobi 
method is not. 
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4.8 For the matrix 

/ 1 -2 2 
A = ( -1 1-1 

V -2 -2 1 

show that the Jacobi method is convergent and the Gauss-Seidel method is not. 

4.9 For the matrix 

/ 2 1 -1 
A= -2 2 -2 

V 1 1 2 

show that the Gauss-Seidel method is convergent and the Jacobi method is not. 

4.10 Show that the matrix 

/ 2 0 -1 -1 \ 

. 0 2—1—1 

A = - 1-12 0 
\ —1 — 1 0 2 / 

is irreducible and that the Jacobi method is not convergent. 

4.11 Show that the iteration matrix of the Gauss-Seidel method has eigenvalue 
zero. 

4.12 Consider the variant of the Gauss-Seidel iteration where the components 
are iterated from the nth component backward to the first component. What is 
the iteration matrix of this method? Obtain a symmetric method by alternating 
one step of the forward Gauss-Seidel method and one step of the backward 
Gauss-Seidel method. What is the iteration matrix of this method? 

4.13 Show that the Jacobi iteration converges for a matrix A if and only if it 
converges for the transposed matrix A T . 

4.14 Show that the matrix A of Example 2.2 is irreducible, positive definite, 
and weakly row-diagonally dominant. 

4.15 Compute the eigenvalues of the Jacobi iteration matrix for the matrix A 
of Example 2.2. 

4.16 Let A = ( djk ) be a nonnegative n x n matrix, i.e., ajk > 0, J, k = 1, . . . , rc, 
and let p{A) < 1. Show that I — A is nonsingular and (/ — A) -1 is nonnegative. 

4.17 Give a counterexample to show that the Jacobi method, in general, does 
not converge for positive definite matrices (see Theorem 4.12). 

4.18 Show by direct computations that the eigenvectors given by (4.19) axe 
orthogonal. 

4.19 Prove the relations (4.21), (4.22), (4.24), and (4.25). 

4.20 Show that 

p(Tjv) < max [t(l — t) N 4- (1 — t) N t] 
o<t< i 

for the two- grid iteration matrix with N damped Jacobi iterations at each step. 




5 

Ill-Conditioned Linear Systems 



For problems in mathematical physics Hadamard [31] postulated three re- 
quirements: A solution should exist, the solution should be unique, and the 
solution should depend continuously on the data. The third postulate is 
motivated by the fact that in general, in applications the data will be mea- 
sured quantities and therefore always contaminated by errors. A problem 
satisfying all three requirements is called well-posed . Otherwise, it is called 
ill-posed. If A : X -» Y is a bounded linear operator mapping a normed 
space X into a normed space Y, then the equation Ax — y is well-posed 
if A is bijective and the inverse operator A~ 1 : Y -> X is bounded (see 
Theorem 3.24). Since the inverse of a linear operator again is linear, in 
the case of finite-dimensional spaces X and Y, by Theorem 3.26 bijectivity 
of A implies boundedness of the inverse operator. Hence, in the sense of 
Hadamard, nonsingular linear systems are well-posed. 

However, since one wants to make sure that small errors in the data 
of a linear system will cause only small errors in the solution, there is an 
additional need for a measure of the degree of well-posedness, or stability. 
Such a measure is provided through the notion of the condition number, 
which we will introduce in this chapter. This will enable us to distinguish 
between well- conditioned and ill-conditioned linear systems. For the latter, 
small errors in the data may cause large errors in the solution, and therefore 
their numerical solution requires special care. 

Hence, we will continue the chapter with a brief discussion of the singular 
value cutoff and the Tikhonov regularization as efficient means to deal with 
ill-conditioned linear systems. Our analysis will be based on the singular 
value decomposition and will include the introduction of the pseudo-inverse, 
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or Moore-Penrose inverse. For an extension of these ideas to ill-posed linear 
operator equations in infinite-dimensional spaces we refer to [14, 22, 28, 37, 
39, 43]. 



5.1 Condition Number 



We begin with an example of an ill-conditioned linear system arising through 
a simple least squares problem. 

Example 5.1 We consider the best approximation of a given continuous 
function / : [0, 1] ->• IR by a polynomial 



p(x) = a k x k 
k = 0 

of degree n in the least squares sense, i.e., with respect to the L 2 norm. 
Using the monomials x x k , k = 0, 1, . . . ,n, as a basis of the subspace 

P n C C[0, 1] of polynomials of degree less than or equal to n (see Theorem 
8.2), from Corollary 3.53 and the integrals 

f x^ x k dx = 

Jo i + * + i 

it follows that the coefficients ao,...,a n of the best approximation are 
uniquely determined by the normal equations 



*= 0 

In the special case 



n ' j. /*! 

t,T ~ i a k= f( x ) x 3 dx > 3 = 0> • • • , n. 

t^3 + K + l Jo 



/(x) = 



1 + X 



we have the right-hand sides 

rj := [ dx, j - 0, . . . , n. 

Jo t -f- x 

In particular, ro = In 2, and from the geometric sum 



(5.1) 






1 - (-iyy 

1 + X 



we deduce that 




j = 



j = l,...,n. 
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Therefore, the solution of (5.1) is of the form 

otj = /3j In 2 + 7j , j = 0 , . . . , n, 

with rational numbers /3j and 7j. Table 5.1 gives the exact solution of the 
linear system (5.1) obtained by Gaussian elimination carried out in terms of 
rational numbers to compute the coefficients f3j and 7 j and then inserting 
In 2 with ten-decimal-digit accuracy. The results indicate convergence of 
the coefficients to the coefficients otk = (— 1)* of the Taylor series for /. 

TABLE 5.1. Exact solution of the linear system (5.1) 



n 


a 0 


Qi 


a 2 


Ot 3 


0 L 4 






1 


0.9314 


-0.4766 












2 


0.9860 


-0.8040 


0.3274 










3 


0.9972 


-0.9389 


0.6645 


-0.2247 








4 


0.9994 


-0.9830 


0.8630 


-0.5334 


0.1543 


WEM 




5 


mm 


-0.9956 


0.9512 


BIB 


0.4191 


Eh| 




6 


BUI 


-0.9989 




wm 


0.6672 


Emil 





However, if we take as right-hand sides the values obtained for 77 by using 
In 2 with five-decimal-digit accuracy, then Gaussian elimination yields the 
results of Table 5.2. 

TABLE 5.2. Numerical solution of the linear system (5.1) 



n 


a 0 


a 1 


OL2 


<*3 


OL 4 


«5 


Q!6 


1 


0.93 


-0.47 












2 


0.98 


-0.80 


0.32 










3 


0.99 


-0.95 


0.70 


-0.24 








4 


1.00 


-1.16 


1.63 


-1.69 


0.72 






5 


1.06 


-2.74 


12.68 


-31.16 


33.87 


-13.25 




6 


1.39 


-16.58 


151.09 


-584.79 


1071.93 


-926.75 


304.49 



Despite the fact that the changes in the right-hand sides are less than 
0.000005, we obtain drastic changes in the solution. Therefore, qualitatively 
we may say that our linear system provides an example of an ill-conditioned 
system. The matrix of this example is known as the Hilbert matrix. □ 

For a quantitative analysis of the phenomenon illustrated by Example 
5.1 we introduce the concept of the condition number. 
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Definition 5.2 Let X and Y be normed spaces and let A : X -» Y be a 
bounded linear operator with a bounded inverse A~ l : Y -* X. Then 

cond(^) := P||P- X || 

is called the condition number of A. 

Clearly, cond(A) depends on the chosen norm. Because of (see Remark 
3.25) 

i = ii/ii = ii^- i ii<piip- i n 

we always have cond(A) > 1. Definition 5.2, in particular, includes the 
condition number of a nonsingular nxn matrix A. Here, in the case where 
both the domain and range are given the £ p norm for p = 1, 2, oo we will 
write cond p (A). 

Theorem 5.3 Let X and Y be Banach spaces, let A : X Y be a bounded 
linear operator with a bounded inverse A~ l : Y -* X and let A 8 : X — » Y 
be a bounded linear operator such that ||A _1 || \\A 5 - A\\ < 1. Assume that 
x and x s are solutions of the equations 





Ax = y 


(5.2) 


and 








II 


(5.3) 


respectively . Then 







Is* - *11 < cond(yl) f || y s - y |[ \\A S - A\\ ) 

M M Ml 7 



Proof Writing A 6 = A[I + A X (A 5 — A )], by Theorem 3.48 we observe 
that the inverse operator [A* 5 ] -1 = [I + A~ l (A 8 - A)] -1 A -1 exists and is 
bounded by 



IP*]' 1 !! < 



P- 1 !! 

l-p-MI II A* - All 



(5.4) 



From (5.2) and (5.3) we find that 



A s (x s — x) — y s — y — (A 6 — A)x, 



whence 

x s - x = [A <5 ]- 1 {y 5 - y - (A s - A)x} 
follows. Now we can estimate 



II * 4 - *11 < IPVlKllw 4 - 2/11 + P 4 - A\\ M|} 
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and insert (5.4) to obtain 

llz* - z|| ^ cond(^4) f || 2 / <5 - y|| \\A S - A\\ \ 

||z|| - i - p- 1 !! \\a s - A\\ \ \\A\\ ||z|| p|| /• 

From this the assertion follows with the aid of ||A|| \\x\\ > ii'/ii- □ 

Theorem 5.3 shows that the condition number may serve as a measure of 
stability for linear operator equations and, in particular, for linear systems. 
A linear system with a small condition number is stable, whereas a large 
condition number indicates instability. We call a linear system with a small 
condition number well- conditioned. Otherwise, it is called ill-conditioned. 

By Theorem 3.31, the condition number of a Hermitian matrix A in the 
Euclidean norm is given by 

cond 2 (A) = , 

I'Vnin I 

where A max and A m i n denote the eigenvalues of A with largest and smallest 
modulus, respectively. Table 5.3 is obtained by employing the QR algorithm 
(see Section 7.4) for the computation of matrix eigenvalues. It illustrates 
quantitatively the degree of instability, i.e., the ill-conditionedness of the 
linear system from Example 5.1. 

TABLE 5.3. Condition number for the linear system (5.1) 



n 


2 


3 


4 


5 


6 


Amax 

Amin 

cond 2 


1.27 

6.57- 10~ 2 
19.3 


1.41 

2.69 • 1(T 3 
5.24 • 10 2 


1.50 

9.67 - 10“ 5 
1.55 • 10 4 


1.57 

3.29 • 10~ 6 
4.77 • 10 5 


1.62 

1.08 • 10- 7 
1.50 • 10 7 



5.2 Singular Value Decomposition 

In the sequel we wish to introduce some of the basic concepts for the 
approximate solution of ill-conditioned linear systems. Our approach will 
be based on the singular value decomposition of a matrix A, which need 
not be a square matrix. 

For each m x n matrix A, representing an operator A : C n — » <D m , the 
nxn matrix A* A is Hermitian and positive semidefinite (see Problem 5.9). 
Therefore, the eigenvalues of A* A are real and nonnegative (see Theorem 
3.29). The nonnegative square roots of these eigenvalues are called the 
singular values of A. 

For the remainder of this chapter, by (• , •) we denote the Euclidean 
scalar product in C n . For an m x n matrix A of rank r, the nullspace 
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N(A) = {x E C n : Ax = 0} has dimension dimN(A) = n — r. We note 
that A* Au — 0 implies that 

||j4m||2 = {Au, Au) = (u,A*Au) = 0; 

i.e., the nullspaces of A and A* A coincide. Hence dim N(A*A) = n — r , and 
therefore A has exactly r positive singular values p (counted according to 
their geometric multiplicity, i.e., according to the dimension of the nullspace 
of /i 2 /- AM). 

Theorem 5.4 Let A be an m xn matrix of rank r. Then there exist non- 
negative numbers 



/il > fl 2 ^ ^ fx r > /X r _j_i — • • • — fX n — 0 

and orthonormal vectors U \ , . . . , u n E C n and v \ , . . . , E C m swcft that 
Auj = iijVj , — ii jUj, j = 1, . . . ,r, 

Awj = 0, j = r + 1, . . . ,n, (5.5) 

=0, j = r 4- 1, . . . , m. 

For each x e <E n we have the singular value decomposition 

r 

Ax = '%2n j {x,u j )v j . (5.6) 

j=i 

Each system {jij,uj,Vj) with these properties is called a singular system of 
the matrix A. 

Proof The Hermitian and semipositive definite matrix A* A of rank r has 
n orthonormal eigenvectors u \ , . . . , u n with nonnegative eigenvalues 

A*Auj = fijUj, j = 1, . . . ,n, (5.7) 

which we may assume to be ordered according to pi > > * • * > p> r > 0 

and /i r -f-i = • • • = pi n = 0. We define 

Vj:= — Auj, j = 

Then, using (5.7) we have 

(vj,v k ) = — (Auj,Au k ) = -i— (uj, AMU*) = J, A? = 1, — , r, 
PjPk PjPk 

where Sj k = 1 for A; = j, and 6j k = 0 for k ^ j. Further, we compute that 
A*Vj = fijUj , j = 1, . . . ,r, and hence the first line of (5.5) is proven. The 
second line of (5.5) is a consequence of N(A) — N(A*A). 
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If r < m, by the Gram-Schmidt orthogonalization procedure from The- 
orem 3.18 we can extend v \ , . . . , v r to an orthonormal basis v \ , . . . , v m of 
C m . Since A * has rank r, we have dim N(A*) — m — r. From this we can 
conclude the third line of (5.5). 

Since the rq, . . . , u n form an orthonormal basis of C n , we can represent 

n 

X = 

3 = 1 

and (5.6) follows by applying A and observing (5.5). □ 



Clearly, we can rewrite the equations (5.5) in the form 

A = VDU* , (5.8) 



where U = (u \ , . . . , u n ) and V — {v \ , . . . , v m ) are unitary n x n and mxm 
matrices, respectively, and where D is an m x n diagonal matrix with entries 
djj — /ij for j = 1, . . . , r and djk =0 otherwise. 

Theorem 5.5 Let A be an m x n matrix of rank r with singular system 
(fij,Uj,Vj). The linear system 



Ax = y 



(5.9) 



is solvable if and only if 



(y,z) = o 



(5.10) 



for all z € C m with A*z = 0. In this case a solution of (5.9) is given by 



r 1 

*o = V) — { y,vj)uj . 
j= 1 Vi 



(5.11) 



Proof. Let x be a solution of (5.9) and let A*z = 0. Then 



(y, z) = {Ax, z) = (x, A*z) = 0. 



This implies the necessity of condition (5.10) for the solvability of (5.9). 

Conversely, assume that (5.10) is satisfied. In terms of the orthonormal 
basis vi,...,v m of C m condition (5.10) implies that 

r 

y = Yl( y ’ v ^ v j’ ( 5 - 12 ) 

3 = 1 

since A*Vj = 0 for j = r + 1, . . . , m. For the vector xq defined by (5.11) we 
have that 

r 

Ax 0 = ^2{y,Vj)vj. 

j = 1 

In view of (5.12) this implies that Ax o = y , and the proof is complete. □ 
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Since N(A) = span{w r +i, . . . , u n }, the vector xq defined by (5.11) has 
the property 

(x 0 ,x) = 0 

for all x G N(A). In the case where equation (5.9) has more than one 
solution, the general solution is obtained from (5.11) by adding an arbitrary 
solution x of the homogeneous equation Ax — 0. Then from 

\\xq+x\\\ = ||x 0 ||2 +2Re(x 0 ,ar) + \\x\\l = \\x 0 \\l + ||a:||| 

we observe that (5.11) represents the uniquely determined solution of (5.9) 
with minimal Euclidean norm. 

In the case where equation (5.9) has no solution, we represent 

rri 

v = ^2(y’ v j) v j 

3 = 1 

in terms of the orthonormal basis Let xq be given by (5.11) and 

let x G C n be arbitrary. Then 

(Ax - A: r 0 , Ax 0 - y) = 0, 

since Ax — Axo G span{r>i, . . . ,v r } and Axo —y£ spanjtv+i, . . . ,u m }. This 
implies 

\\Ax - y\\\ - par - Ax 0 \\l + \\Ax 0 - y\\\, 

whence (5.11) represents a least squares solution of (5.9) (see Example 2.4). 
Again, it can be shown that (5.11) is the uniquely determined least squares 
solution of (5.9) with minimal Euclidean norm (see Problem 5.11). 

Hence, (5.11) defines a linear operator A t : C m -» C n by 

tfy ■.= '^2,— {y, v j) u j, y € C m , (5.13) 

3=1 ^ 

which of course also allows a representation by an n x m matrix. Due to 
the properties of A^y as discussed above, this operator or matrix is known 
as the pseudo-inverse or Moore-Penrose inverse of A (see [7]). It was first 
introduced by Moore in 1920 and independently rediscovered by Penrose 
in 1955. For an alternative introduction of A t see Problem 5.12. 

By Theorem 3.31 the condition number of a nonsingular matrix with 
respect to the Euclidean norm is given by the quotient of the largest and 
smallest singular value. Theorem 5.5 demonstrates the influence of small 
singular values on the condition of the matrix A. If for some S G C we 
perturb the right-hand side by setting y 6 = y -f Svj , we obtain a perturbed 
solution x s = x + Suj/pij. Hence, the ratio ||ar 5 — — 2/II2 = 1 /^j 

becomes large if A possesses small singular values. 
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This observation suggests stabilizing an ill-conditioned linear system by 
damping or filtering out the influence of the factor l/pj in the solution 
formula (5.11). In the so-called spectral cutoff \ the terms in (5.11) cor- 
responding to small singular values are simply neglected. Of course, this 
requires some strategy on how to determine the number of terms being 
summed up in (5.11). A very effective strategy is provided by the following 
discrepancy principle . If the right-hand side y of a linear system is known 
only within an error level 6 then it is quite natural to require Ax — y to be 
satisfied only up to the same accuracy 6 , since it does not make much sense 
to try to satisfy the linear system more accurately than the right-hand side 
is known. To describe the discrepancy principle more precisely, given an 
erroneous right-hand side y s with known error level \\y s — y \\ 2 < <5, in the 
spectral cutoff the solution x — A^y of Ax = y is approximated by 

p 1 

x p :=Y^— ( y s ,vj ) Uj (5.14) 

j = 1 ,l} 

for some 0 < p < r. For the following theorem we have to assume that 
Ax — y is solvable. 

Theorem 5.6 Let A be an m x n matrix with singular system (fj,j,Uj,Vj) 
and let y G A( C n ), y s G C m satisfy 

\\y s -yh<s<\\y 5 \\ 2 

for S > 0. Then there exists a smallest integer p = p(S) such that 

\\Ax p - y% < <5. (5.15) 

This discrecancy principle for the spectral cutoff is regular in the sense that 
if the error level 8 tends to zero , then 

Xp — y Afy, 8 — y 0. (5.16) 

Proof Consider the function F : {0, 1, . . . , r} — ► IR defined by 

F(p) := \\Ax p - y s \\l - 6 2 . 

In terms of the singular system, we can write 

m 

F(P)= £ K^)! 2 -* 2 . (5.17) 

j=p+l 

Hence, F is monotonically nonincreasing with F(0) = || y s \\ 2 — 8 2 > 0 and 
F(r) = —8 2 < 0 if the rank r of A is equal to m. If r < m, then using 
(y,Vj) ~ 0, j = r + 1, . . . , m (see the proof of Theorem 5.5), we have 

m 

F(r)= £ l(/ - y,vj)\ 2 -S 2 < H?/ 5 - y\\l~S 2 < 0. 

j=r - fl 




86 



5. Ill-Conditioned Linear Systems 



Therefore, there exists a smallest integer p = p(S) such that F(p) < 0. Note 
that p < r. In actual computations, this stopping parameter p is determined 
by terminating the sum (5.14) when the right-hand side of (5.17) becomes 
smaller or equal to zero for the first time. 

In order to show the convergence (5.16), we note that \\Ax p — y s U 2 < S 
implies 

\\A.x p - y\\ 2 < || Ax p - /|| 2 + \\y s - y\\ 2 < 2<5 -> 0, 5 ->■ 0, 

i.e., Ax p -> y, S — > 0. From this, since A^Av = v for all v £ span{iT, . . . ,u r }, 
we finally can conclude that x v -> A^y, 6 -> 0. □ 

The spectral cutoff method requires the full solution of the eigenvalue 
problem for the matrix A* A, which we will describe in Chapter 7. As an 
alternative, in the following section we shall describe the Tikhonov regu- 
larization, which can be performed without explicitly knowing the singular 
value decomposition. 



5.3 Tikhonov Regularization 

Tikhonov regularization as introduced independently by Phillips in 1962 
and Tikhonov 1963 is obtained from (5.11) by multiplying l/pj by the 
damping factor 

_A_ 

a + y]’ 

where a is some positive regularization parameter. 

Theorem 5.7 Let A be an m x n matrix of rank r with singular system 
(Pj^j^vj) and let a > 0. Then for each y £ <D m the linear system 

ax a + A* Ax a = A*y (5.18) 

is uniquely solvable, and the solution is given by 

Xa = ^2 —T~2 (y> v j) U 3- ( 5 ' 19 ) 

j= 1 “ + *3 

Proof. For a > 0 the matrix al + A* A is positive definite and therefore 
nonsingular. Since 

auj + A*Auj — (a- f- p^)uj, 

a singular system for the matrix al 4- A* A is given by (a -f p^,Uj,Uj), 
j = 1, ... ,n. Now the assertion follows from Theorem 5.5 with the aid of 
(A*y,Uj) = ( y,Auj ) and using (5.5). □ 
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Corollary 5.8 Under the assumptions of Theorem 5.7 we have conver- 
gence: 

lim (al + A*A)~ l A*y = A^y. 

a — >0 

Proof. This is obvious from (5.13) and (5.19). □ 

Before we proceed with a discussion on how to choose the regularization 
parameter a, we give an interpretation of Tikhonov regularization as a 
penalized least squares method. 

Theorem 5.9 Let A be an m x n matrix and let a > 0. Then for each 
y E <D m there exists a unique x a E C n such that 

\\Ax a - y\\l + a\\x a \\l = inf n {||Ar - y\\% + a||a;|||}. (5.20) 

The minimizing vector x a is given by the unique solution of the linear 
system (5.18). 

Proof. (Compare to the proof of Theorem 3.51.) We first note the relation 

\\Ax - 2/11! + a|Ml! = \\ A *a - y\\l + alkali! 

+ 2Re(x — x a ,ax a + A*Ax a — A*y) (5.21) 

+ ||Ae - Ax a HI + a||a; - x a \\\, 

which is valid for all x,x a E C n . From this it is obvious that the solution 
x a of (5.18) satisfies (5.20). 

Conversely, let x a be a solution of (5.20) and assume that 
ax a 4- A* Ax a ^ A*y. 

Then, setting z ax a + A*Ax a — A*y , for x x a — ez with e E IR from 
(5.21) we have 

\\Ax - y\\l + a\\x\\l = || Ax a - y\\\ + a||a : Q ||2 - 2 ea + e 2 b, 

where 

a := \\z\\l and b ||^4^||| + or||^||| 
are both positive. By choosing e = a/b we obtain 

II Ax - y\\\ + a|k||| < || Ax a - y\\\ + ot||ar|||, 
which contradicts (5.20). □ 

The interpretation of Tikhonov regularization through the above Theo- 
rem 5.9 indicates that it keeps the residual \\Ax a —y\\ \ small and stabilizes 
by preventing x a from becoming large through the penalty term «||^or||i- 
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From the proof of Theorem 5.7 we know that the eigenvalues of the 
Hermitian matrix al + A* A are given by a + fij, j = 1, . . . ,n. Hence by 
Theorem 3.27 we have that 

cond 2 ( al + A* A) — Q - ~*~ , 0 < a < //?. (5.22) 

Therefore stability of the linear system (5.18) requires the regularization 
parameter a to be fairly large. On the other hand, in order to keep the 
system (5.18) reasonably close to the original system Ax = y , we expect 
that a needs to be small. This observation is made more precise through the 
following considerations on the error occurring in Tikhonov regularization. 




a 

FIGURE 5.1. Total error for Tikhonov regularization 

Given an erroneous right-hand side y s with error level \\y 5 - y \\ 2 < <5, the 
Tikhonov regularization approximates the solution x — A^y of Ax — y by 
the solution x a of the regularized linear system 

ax a +A*Ax a =A*y 6 . (5.23) 

Then, for the total error, writing 

x a -x=(al + A*A)~ l A* (y s -y) + (al + A* A)~ 1 A*y - A^y, 

by the triangle inequality we have the estimate 

Ilia - *||2 < ||(a/ + A*A)~ 1 A*\\ 2 6 + II (al + A*A)~ 1 A*y - A'y\\ 2 . 

This decomposition shows that the total error consists of two parts: 

-Ftotal ^ -F'data "b -F'approx* 

The first term, with the aid of Theorem 3.31, can be estimated by 
£data = ||(a/ + A*Ay l A*\\ 2 8 > J. 

OL t 
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It reflects the influence of the incorrect data and, for fixed 5 , becomes large 
as a -» 0, if the smallest positive singular value p r is close to zero (see also 
Problem 5.16). The second term, 

Approx = ||(a/ + A*A)-'A*y - tfy\\ 2 , 

describes the approximation error due to the replacement of Ax — y by the 
regularized equation (5.23), and by Corollary 5.8, it goes to zero as a — » 0. 
This error behavior is illustrated in Figure 5.1. 

On one hand, in view of (5.22) the stability of the system requires a large 
regularization parameter a to keep i^data small, i.e., to keep the influence 
of the data error \\y 6 —y\\ 2 small. On the other hand, keeping Approx small 
asks for a small parameter a. 

Obviously, the choice of the parameter a has to be made through a 
compromise between accuracy and stability. An efficient strategy to achieve 
this is again provided by the discrepancy principle. In the following theorem 
we need to assume that Ax — y is solvable. 

Theorem 5.10 Let A be an m x n matrix and let y E A(<E n ), y s E C m 
satisfy 

\\y s - 2 / II2 < s < || j /' 5 || 2 

for S > 0. Then there exists a unique a = a(S) > 0 such that the unique 
solution x a of (5.23) satisfies 

\\Ax a -y%=5. (5.24) 

This discrecancy principle for Tikhonov regularization is regular in the 
sense that if the error level 5 tends to zero , then 

x a -4 A*y, <5-4-0. (5.25) 

Proof. We have to show that the function F : (0, 00 ) -* IR defined by 

F(a) := || Ax a - y s ||| - 6 2 

has a unique zero. In terms of a singular system, from the representation 
(5.19) we find that 



m 2 

Therefore, F is continuous and strictly monotonically increasing with the 
limits F(a) -» — 5 2 < 0, a -» 0, and F(a) -» \\y 5 \\% - 5 2 > 0, a -+ 00 . 
Hence, F has exactly one zero a = o(J). 

Note that the condition \\y s — y \\ 2 < 5 < H 2 /H 2 implies that y ^ 0. Using 
(5.23), (5.24), and the triangle inequality we can estimate 

ll/ll2-^ = ||2/ 6 ||2-||Aar a -^|| 2 <Px a || 2 
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and 

a||ArJ 2 = II AA*(y s - Ax a ) || 2 < \\AA*\\ 2 5. 

Combining these two inequalities and using \\y 6 1| 2 > \\y\\2 —5 yields 

~ IMh-2 <T 

This implies that a — > 0, S -* 0. Now the convergence (5.25) follows from 
the representations (5.13) for A^y and (5.19) for x a (with y replaced by 
y 6 ) and the fact that \\y 6 - y\\ 2 -> 0, S -» 0. □ 

In practice, of course, one does not need to determine the regularization 
parameter satisfying (5.24) exactly. Usually the following strategy will be 
sufficient: Choose some moderately sized a and then keep decreasing a by 
a constant factor 7, say 7 = 0.5, until F(a) becomes negative. 

In order to illustrate that Tikhonov regularization works, Table 5.4 gives 
some numerical results for the linear system of Example 5.1 with the erro- 
neous right-hand side generated by using In 2 « 0.69315 and choosing the 
regularizing parameter a = 10 -10 (without attempting to use Theorem 
5.10). 



TABLE 5.4. Regularized solution of the linear system (5.1) 



n 


Qo 


OLi 


OL 2 


OLZ 


a 4 


OL 5 


a 6 


1 




-0.4767 












2 


Ml 




0.3285 










3 


QK| 


-0.9546 


0.7021 


-0.2491 








4 




-1.0193 


1.0154 


-0.7605 


0.2644 






5 




-0.9659 


0.7236 


-0.1458 


-0.2838 


0.1735 




6 




-0.9618 


0.6564 


0.0254 


-0.2818 


-0.1512 


0.2166 



Problems 



5.1 For the condition number of linear operators show that 

cond(AB) < cond(A) cond(H). 

5.2 Let A be an n x n matrix and Q be a unitary n x n matrix. Show that 

cond2(QA) = cond2(A) 



and 



cond2(A*A) > cond2(A). 

5.3 Determine cond 2 (A) for the matrix A of Example 2.1 and discuss its be- 
havior for large n. 
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5.4 Find the inverse of the matrix 




A = 



and find the condition numbers cond P (A) for p = 1,2, oo. 

5.5 Find the inverse of the matrix 



( 10 1 
1 10 
4 5 

V 0 -1 



4 0 

5 -1 



10 

7 



and find the condition numbers cond p (A) for p = 1, 2, oo. 

5.6 Calculate condoo(A) for the matrix 

/! ! 1 \ 

A = 1 10 100 . 

\ 1 100 10000 / 

Show that one can improve the condition of a matrix by scaling through calcu- 
lating condoo(DA) where D is the diagonal matrix 

D = diag(l/3, 1/111, 1/10101). 

5.7 Let A = (a jk) be an n x n matrix satisfying 



= j = i » •••>»• 



jfe=i 



Show that 



condoo(A) < condoo(L)A) 



for all n x n diagonal matrices D (see Problem 5.6). 



5.8 For a nonsingular matrix A show that 

= m min{||B|1 ■ A+Bis sin s uiar )- 

This indicates that if a nonsingular matrix has a large condition number, it is 
close to a singular matrix. 

5.9 Show that for an m x n matrix A the n x n matrix A* A is Hermitian and 
positive semidefinite. 



5.10 Find the singular value decomposition of 



A = 



10 1 1 \ 

10—101 
11 0 1 / 




92 



5. Ill-Conditioned Linear Systems 



5.11 Show that A^y is the least squares solution of Ax = y with minimal norm. 

5.12 Show that the pseudo-inverse A * is uniquely determined by the properties 

AA* = (AA*Y, A*A = (A'A)\ AA f A = A, A^AA* — A*. 

Express the pseudo-inverse in terms of the decomposition (5.8). 

5.13 For the pseudo-inverse show that (A*)* = A and (A*)* = (A*)*. 

5.14 Give an example to show that in general, (A£?)* ^ B*A^. 

5.15 What is the pseudo-inverse of A : € n — > <D m given by Ax = ( x,a)b with 
a e (D m and b e <D n ? 

5.16 For an m x n matrix show that 

||(aJ + A*A)-M*|| 2 <-£= 

v<* 



for a > 0. 

5.17 Give an alternative proof of Theorem 5.9 by using the necessary and suf- 
ficient conditions for the minimum of a function of n variables. 

5.18 Let X and Y be finite-dimensional pre-Hilbert spaces and let A : X — > Y 
be a linear operator. Show that there exists a uniquely determined linear operator 
A* : Y — > X with the property 

( Ax,y) Y = (x,A"y)x 

for all x 6 X and y E Y. Use this result to formulate and prove a generalization 
of Theorem 5.9 for the minimization of 

\\Ax-y\\ 2 Y +a||x||x- 



5.19 Show that 



( x > y) - ~ yj- 1 ) 

3 = 0 j = 1 

defines a scalar product on <D n . Discuss its use in Tikhonov regularization as 
indicated in Problem 5.18, where in addition to large components of the solution 
vector oscillations between consecutive components are also penalized. 

5.20 Show that A : C[0, 1] -» C[0, 1] defined by 

(Af)(x):=[ f(y)dy, *€[0,1], 

Jo 

is a bounded linear operator that does not have a bounded inverse; i.e., show 
that differentiation is an ill-posed problem. 




6 

Iterative Methods for 
Nonlinear Systems 



In this chapter we will study the solution of systems of nonlinear equa- 
tions. As opposed to linear equations, no explicit solution techniques are, 
in general, available for nonlinear equations, and hence their solution com- 
pletely relies on iterative methods. In the first section we shall begin with 
the application of the Banach fixed point theorem for systems of nonlin- 
ear equations with one or several variables. Given the fact that iterative 
techniques have a long history in mathematics, the significance of Banach’s 
fixed point theorem originates from its unified approach, covering a wide 
variety of different successive approximation methods. 

In the second section, we will continue with the study of Newton’s it- 
eration method for finding zeros of functions of one or several variables. 
This iteration scheme is attributed to Newton, since in 1669 he developed 
a solution method for cubic equations by linearization that may be viewed 
as a precursor of what is now known as Newton iteration. He also used this 
method for approximately solving Kepler’s equations for planetary motion. 

In the concluding two sections of this chapter we will consider the appli- 
cation of Newton’s method for finding zeros of polynomials and its modifi- 
cation into the more recently developed Levenberg-Marquardt scheme for 
solving the least squares problem. 

Given the vast number of iterative methods available for nonlinear equa- 
tions, we will confine our presentation to describing the fundamental ideas 
and will not aim at a complete treatment of the subject. 
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6.1 Successive Approximations 



In this section, we will consider systems of n nonlinear equations for n 
unknowns of the form 

f{x) = X, 

where x = (xi,...,x n ) T and f(x) = (/i(xi, . . . ,x n ), . . . , f n {xi, . . . ,x n ) T . 
We begin by studying the case of a single nonlinear equation with one 
unknown. Obviously, in one dimension, solving f{x) = x geometrically 
corresponds to determining the intersection of the graph of the function / 
with the straight line described by the function x »-> x. 

Theorem 6.1 Let D C JR be a closed interval and let f : D -¥ D be a 
continuously differentiable function with the property 

q := sup |/'(x)| < 1. 

xtzD 



Then the equation f(x) — x has a unique solution x 6 D, and the successive 
approximations 

%v+i •— / v — 0 , 1 , 2 ,..., 



with arbitrary xq gD converge to this solution . We have the a priori error 
estimate 



\x v — x\ < 



q u 

i -q 



\xi - x 0 | 



and the a posteriori error estimate 



\x v — x\ < 



\X U Xi/ — i\ 



for all v e IN. 

Proof Equipped with the norm || • || = | • | the space IR is complete. By the 
mean value theorem, for x,y £ D with x <y, we have that 

/(*) - f(y) = f'(0{x-y) 

for some intermediate point £ 6 (x,y). Hence 

1/0*0 - f(y ) I < sup I/' (01 \x-y\ = q\x - y\, 

Z€D 

which is also valid for x,y £ D with x > y. Therefore, / is a contraction, 
and the assertion follows from the Banach fixed point Theorem 3.46. □ 

Figure 6.1 illustrates graphically the successive approximations for func- 
tions / with positive and negative slope, respectively, of absolute value 
less than one. Note that the sequence (x u ) converges to the fixed point 
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monotonically if / has positive slope and that it converges with values al- 
ternating above and below the fixed point if / has negative slope. In both 
cases the slope of the function / has absolute value less than one in a 
neighborhood of the fixed point. From drawing a corresponding figure for 
a function with a slope of absolute value greater than one it can be seen 
that the corresponding iteration will move away from the fixed point (see 
Problem 6.2). 





FIGURE 6.1. Fixed point iteration 

The following theorem states that for a fixed point x with \f'(x)\ < 1 we 
always can find starting points xo ensuring convergence of the successive 
approximations. 

Theorem 6.2 Let x be a fixed point of a continuously differentiable func- 
tion f such that \f(x)\ < 1. Then the method of successive approximations 
x u . |_i := f(x u ) is locally convergent; i.e there exists a neighborhood B of 
the fixed point x such that the successive approximations converge to x for 
all xo G B. 

Proof. Since /' is continuous and \f'(x)\ < 1, there exist constants 0 < q < 1 
and 6 > 0 such that |/'(i/)| < q for all y G B := [x — 6, x + 6]. Then we have 
that 

1/(2/) - A = 1/(2/) - /(*) I < q\y - x\ < \y - x\ < 6 
for all y G B; i.e., / maps B into itself and is a contraction / : B — ► B. 
Now the statement of the theorem follows from Theorem 6.1. □ 

Theorem 6.2 expresses the fact that for a fixed point x with \f(x)\ < 1 
the sequence x^+i := f(x„) converges if the starting point xo is sufficiently 
close to #. In practical situations the problem of how to obtain such a good 
initial guess is unresolved in general. Frequently, however, a good estimate 
of the fixed point might be known a priori from the underlying application 
or might be deduced from analytic observations. 

The following examples illustrate that in some cases we also have global 
convergence , where the successive approximations converge for each start- 
ing point in the domain of definition of the function /. 
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Example 6.3 In order to describe a division by iteration, for a > 0 we 
consider the function / : IR -> IR given by f(x) \—2x — ax 2 . The graph of 
this function is a parabola with maximum value 1/a attained at 1/a. By 
solving the quadratic equation f(x) = x it can be seen that / has the fixed 
points x = 0 and x = 1/a. Obviously, / maps the open interval (0,2/a) 
into (0, 1/a). Since f(x) = 2(1 — ax ), we have /'( 0) = 2 and /'(1/a) = 0. 
From the the property x < f(x) < 1/a, which is valid for 0 < x < 1/a, 
it follows that the sequence x v +\ 2x v — ax 2 is monotonicly increasing 
and bounded. Hence, the successive approximations converge to the fixed 
point x = 1/a for arbitrarily chosen xo £ (0, 2/a). Figure 6.2 illustrates the 
convergence. The numerical results are for a = 2 and two different starting 
points, xq = 0.3 and x 0 = 0.4. □ 




V 


x v 


x v 


0 


0.30000000 


0.40000000 


l 


0.42000000 


0.48000000 


2 


0.48720000 


0.49920000 


3 


0.49967232 


0.49999872 



Division by iteration 



Example 6.4 For computing the square root of a positive real number a 
by an iterative method we consider the function / : (0, oo) — > (0, oo) given 

/(x) ; =i (x+2). 

By solving the quadratic equation f(x) = x it can be seen that / has 
the fixed point x = yfa. By the arithmetic geometric mean inequality we 
have that f(x) > y/a for x > 0; i.e., / maps the open interval (0, oo) into 
[y/a, oo), and therefore it maps the closed interval [y/a, oo) into itself. From 




) 



it follows that 

q := sup |/'(x)| = I . 

yfa<X< oo ^ 

Hence / : [y/a, oo) — ► [y/a, oo) is a contraction. Therefore, by Theorem 6.1 
the successive approximations 



1 

^+1 := 2 



( 



X i/ "I - 



)■ 



V = o, 1,..., 
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converge to the square root yfa for each xo > 0, and we have the a posteriori 
error estimate 

| \fa - x v \ < \x v - x„_i|. 

Figure 6.3 illustrates the convergence. The numerical results again are for 

a — 2. □ 



V 


x v 


0 


5.00000000 


1 


2.70000000 


2 


1.72037037 


3 


1.44145537 


4 


1.41447098 


5 


1.41421359 


6 


1.41421356 




y/cL 



FIGURE 6.3. Square root by iteration 

In both of Examples 6.3 and 6.4 the numerical values exhibit a very 
rapid convergence. This is due to the fact that because of /'(x) = 0 at the 
fixed point, the contraction number is very small. We shall elaborate on 
this observation later when we consider Newton’s method. 

TABLE 6.1. Iterations for Example 6.5 



V 


Xy 




V 


x v 


0 


1.00000000 




7 


0.72210243 


1 


0.54030231 








2 


0.85755322 








3 


0.65428979 




45 


0.73908513 


4 


0.79348036 




46 


0.73908514 


5 


0.70136877 




47 


0.73908513 


6 


0.76395968 




48 


0.73908513 



Example 6.5 Consider the function / : [0, 1] — > [0, 1] given by 

/(x) := cosx. 

q = sup |/'(x)| = sin 1 < 1, 

0<aKl 



Here we have 




98 



6. Iterative Methods for Nonlinear Systems 



and Theorem 6.1 implies that the successive approximations x v +\ cosx v 
converge to the unique solution x of cosx = x for each xo £ [0,1]. Table 
6.1 illustrates the convergence, which is notably slower than in the two 
previous examples. □ 

By the following example we illustrate how to obtain a fixed point of 
a function with derivative greater than one by working with the inverse 
function. 

Example 6.6 The function h : (0, 1) -> (— oo, oo) given by h(x) := x+\nx 
is strictly monotonically increasing with limits lim x _ > o h(x) = — oo and 
lim^oo h(x) = oo. Therefore, the function f(x) := —\nx has a unique 
fixed point x. Since this fixed point must satisfy 0 < x < 1, the derivative 

i/'(*)i = l> i 

implies that / is not contracting in a neighborhood of the fixed point. 
However, we can still design a convergent scheme because x = — In a; is 
equivalent to e~ x = x. We consider the inverse function 

g(x) := e~ x 

of /, which has derivative | g'(x)\ = e~ x < 1 at the fixed point, so that we 
can apply Theorem 6.2. Obviously, for each 0 < a < 1/e the exponential 
function g maps the interval [a, 1] into itself. Since 

q = sup |<?'(x)| = e~ a < 1, 

a<x<l 

by Theorem 6.1 it follows that for arbitrary xo > 0 the successive approx- 
imations x v +\ = e~~ Xv converge to the unique solution of x = e ~ x . □ 

Now we will extend Theorem 6.1 to systems of nonlinear equations. A 
subset D of a linear space X is called convex if 

Xx + (1 - A )y e D 

for all x,y £ D and all A £ (0, 1), i.e., if the straight line connecting x and 
y is contained in D. 

Theorem 6.7 Let D C IR n be open and convex and let f : D — > IR n be a 
mapping 

fix) - (fl(xi, . . . , X n ), . . . , fn(xi, . . . ,X n ) T , 

where the fj : D — > ]R, j = 1, . . . ,n, are continuously differentiable func- 
tions. By 
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we denote the Jacobian matrix of f. Then we have the mean value theorem 



II f{x) - /(y)|| < o max 1 1 /'[Ax + (1 - A)j/]|| ||x - y\\ 

for all x,y E D (and all norms || • || on JR n ). 

Proof Let g : [0, 1] — > IR n be continuous. We will show that 



\f l g(X)dx\ < f \\g(\)\\d\, 
I Jo I Jo 



( 6 . 1 ) 



where the integral on the left-hand side has to be understood as the vector 
of the integrals over the components of g. The function A ||^(A)|| is 
continuous, since the norm is a continuous function. Therefore, the integral 
on the right-hand side of (6.1) is well-defined. Consider the equidistant 
subdivision A i = i/m,i — 0,1,..., m, for m £ IN. Then we have the 
converging Riemann sums 



and 



m ,>1 

I>(A<)II (*<-*<-!)-► J o IlffWIIrfA, 

m 

y^^(Aj) (A* — Ai_i) -> / g{\)d\, 

i = 1 *'° 



m — > oo, 



m -> oo. 



From the second limit, by the continuity of the norm we conclude that 



^^(Ai-Ai-O 



i— 1 



I i 



g(X)d\ 



m — > oo. 



Now (6.1) follows by passing to the limit m -» oo in the inequality 









< ^IIpCAOII (Ai - A.-O, 



i — 1 



which is a consequence of the triangle inequality. 
Since D is convex, for all x, y € D we have that 



fj( X ) 



- fj(y) = f 

Jo 



d_ 

d\ 



fj[\x + (1 - X)y]dX, j = l,...,n. 



By the chain rule we compute 
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and therefore 



fj( x ) - fj(y) = J o + i 1 ~ %] ( x * - Vk) d\-, 

i.e., in vector form, 

f(x) - f(y) = f /'[Ax + (1 - X)y] ( x - y) d\. 

Jo 

From this, with the aid of (6.1) and the continuity of A h* f'[\x + (1 — A )y], 
we obtain 



ii/m 



-/(tf)ll< Al/'tAx + O-A^llllx-tflldA 

Jo 



< Q max II f[Xx + (1 - X)y]\\ ||x - y ||, 



which ends the proof. □ 

Theorem 6.8 Let D C IR n be closed and convex (with a nonempty inte- 
rior) and let f : D — ► D be a continuous mapping . Assume further that f 
is continuously differentiable in the interior of D and that its Jacobian can 
be continuously extended to all of D such that 

sup ||/'(x)|| < 1 

xeD 

in some norm ||-|| on IR n . Then the equation f(x) = x has a unique solution 
x E D, and the successive approximations 



Xv + 1 • — v — 0 , 1 , 2 ,..., 



converge for each xo € D to this fixed point. We have the a priori error 
estimate 

q v 

\\x u - x\\ < ||xi - x 0 || 

i -q 

and the a posteriori error estimate 



\\x u - x\\ < 



X u — \ | 



for all v £ IN. 

Proof. By the mean value Theorem 6.7 the mapping / : D -> D is a con- 
traction. □ 




6.2 Newton’s Method 



101 



By Theorem 3.26 we have that each of the conditions 



sup max 




sup max 

x£D k=l,...,n 



E 

j = l 






< 1 , 



sup 

x£D 



E 



j,k=l 




1/2 

< 1 



ensures convergence of the successive approximations in Theorem 6.8. 

The following local convergence theorem can be proven analogously to 
Theorem 6.2. 



Theorem 6.9 Let x be a fixed point of a continuously differentiable func- 
tion f such that ||/ , (x)|| < 1 in some norm || • || on IR n . Then the method 
of successive approximations x^i f{x u ) is locally convergent; i.e there 
exists a neighborhood B of the fixed point x such that the successive approx- 
imations converge to x for all starting elements xo G B. 

Example 6.10 For the system 

xi = 0.5 cos #1 — 0.5 sin ^2 



X 2 — 0.5 sin aq -f- 0.5 cos X 2 



we have 

\ _ ( — 0.5sinxi — 0.5 cosx2 \ 
f W ~ V 0.5 cos xi -0.5 sin x 2 ) ’ 

and therefore ||/ , (a:)||2 < V^5 for all x G 1R 2 . Hence Theorem 6.8 is 
applicable. □ 

The reader will not be surprised to learn that for speeding up convergence 
of the successive approximations, concepts developed for linear equations 
like relaxation methods or multigrid methods can also be successfully em- 
ployed in the nonlinear case. However, since we discussed these methods 
in some detail in Sections 4.2 and 4.3 for linear equations, we shall refrain 
from repeating the analysis for nonlinear equations. 



6.2 Newton’s Method 

We now want to determine zeros of a function of n variables; i.e., we want 
to solve equations of the form 



f(x) = 0, 
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where / : D IR n is a continuously differentiable function defined on some 
open subset D C lR n . 

We begin by considering a function of one variable. Let xq be an approx- 
imation to a zero of the function /. In a neighborhood of xq , by Taylor’s 
formula we have that 



f{x) » f(x o) + f'(x o) (x - ar 0 ) =: g(x). 



(6.2) 



Therefore, we may consider the zero of the affine linear function g as a 
new approximation to the zero of / and denote it by x \ . From the linear 
equation 

f{x o) + f'{x o) (xi - x 0 ) = 0 (6.3) 



we immediately obtain 



Xi = Xq 



f(x o) 

f'(x o) ‘ 



Geometrically, the affine linear function g describes the tangent line to the 
graph of the function / at the point xq. 

This consideration can be extended to the case of more than one variable. 
Given an approximation xq to a zero of /, by Taylor’s formula we still have 
the approximation (6.2), where now, as in the previous section, 




denotes the Jacobian matrix of /. Again we obtain a new approximation 
X\ for the solution of f(x) — 0 by solving the linearized equation (6.3), i.e., 
by 

Xi =x 0 - [f'(x 0 )r 1 f(x o). 

Geometrically, the function g of (6.2) corresponds to the hyperplane tan- 
gent to / at the point xq. 

Iterating this procedure leads to Newton’s method, as described in the 
following definition. In the case of one variable, the geometric situation is 
shown in Figure 6.4. 

Definition 6.11 Let D C Et n be open and let f : D -* IR n be a continu- 
ously differentiable function such that the Jacobian matrix f’(x) is nonsin- 
gular for all x G D. Then Newton’s method for the solution of the equation 



f(x) = 0 



is given by the iteration scheme 

av+i := x v - [f(x v )]~ l f(x„), v - 0, 1, ... , 
starting with some xq € D. 




6.2 Newton’s Method 
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We explicitly note that x v +\ is obtained by solving the system of linear 
equations 

/ O^l/) — X v+l) — f{ x v) 

for x v — x„+i; i.e., no matrix inversion is required. 

Example 6.12 For the function 

f(x) :=a~- 

X 

where a > 0, the Newton iteration is given by 

x„+i := 2x u - ax 2 . 

By Example 6.3 we have convergence for all #o £ (0, 2 /a). □ 

Example 6.13 For the function 

f(x) x 2 - a 

where a > 0, the Newton iteration is given by 

z„ +1 := - (*„ + — ) ■ 

By Example 6.4 we have convergence for all xo £ (0, oo ). □ 

Of course, we cannot expect that Newton method’s will always converge. 
However, by the following analysis we can assure local convergence. 

Theorem 6.14 Let D C lR n be open and convex and let f : D — ^ lR n be 
continuously differentiable. Assume that for some norm || • || on IR n and 
some x 0 £D the following conditions hold: 
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(a) f satisfies 

ll/'(*)-/'(y)ll<7ll*-y|| 

for all x,y G D and some constant 7 > 0. 

( b ) The Jacobian matrix f(x) is nonsingular for all x G D , and there 
exists a constant (3 > 0 such that 

ll[/'(*)]- 1 ||</?, xeD. 



(c) For the constants 

a := ll[/'(*o)] _ 1 /(*o)|| and q := a/3 7 

the inequality 

1 

9< 2 

is satisfied. 

( d) For r := 2 a the closed ball B[xo,r] := {x : \\x — xo|| < r} is contained 
in D. 

Then f has a unique zero x* in B[xo,r]. Starting with xo the Newton 
iteration 

x v+1 := x v - [/'(3V)] _ 1 /(2V), U = 0, 1, (6.4) 

is well-defined. The sequence (x u ) converges to the zero x* of f, and we 
have the error estimate 

\\x v - x*\\ < 2 aq 2U ~ l , v — 0 , 1 , 

Proof. 1. Let x,y,z G D. From the proof of Theorem 6.7 we know that 

f(y) - f(x) = f /'[Ax + (1 - A)y] (y - x) d\. 

J 0 



Hence 

f{y) ~ fix) - f'(z) [y-x)= f {/'[Ax + (1 - A)y] - f'{z)} (y - x) dX, 

Jo 

and estimating with the aid of ( 6 . 1 ) and condition (a) we find that 

ll/(y) - fix) -/'(z) iv-x ) || 

< 7ll y - *11 [ II A(x - z) + (1 - A )(y - 2 )|| dX 

Jo 

< Z lly-*ll{ll*- z ll + lly- 2 : ll}- 
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Choosing z — x shows that 

II f{v) ~ f(x) - f'(x) ( y - x)|| < | ||y - x|| 2 (6.5) 

for all x,y G D, and choosing z = x o yields 

II f(y) ~ f(x) - /'(x o) ( y - x)|| < n\\y - x|| (6.6) 

for all x,y G B[xo,r]. 

2. We proceed by proving through induction that 

\\x v - x 0 || < r and \\x u - x v -\ || < aq 2 ” 1_1 , v — 1,2, (6.7) 

This is valid for v — 1, since 

||x 1 -xo|| = ||[/'(x 0 )]- 1 /(^o)|| = a = ^<r 

as a consequence of conditions (c) and (d). Assume that the inequalities 
(6.7) are proven up to some v > 1. Then by condition (b) and since 
x v G B[xo,r] C D, the element x v +\ is well-defined. With the aid of condi- 
tion (b), the definition (6.4) applied to x v , the estimate (6.5), the induction 
assumption, and the definition of q we can estimate 

IkiH-l -Xv\\ = ||[/'(^)] -1 /(^)|| <0\\f(x„)\\ 



= 0\\ f(xu) ~ - f(x u -i)(x u - X„-i)\\ 



< 



2 



Xu- ill < 



h_ 

2 



r 2 l/ ~ 1 — ll 2 o" — 1 

[ aq J = 2 q K aq 



2 v -l 



From this, with the help of the triangle inequality, the induction assump- 
tion, and condition (c), we obtain that 



||®h-i “ x o\\ < \\x„+i ~ x v \\ + b ||a?i - x 0 



< a ^1 + q + q 3 + q 7 H b q 2 ^ ^ Q < 2a = r; 

i.e., the inequalities (6.7) also hold for v -b 1. 

3. For \x > 0, using g < 1/2, we now can estimate 

\\x v < \\Xu ~b * * * ~b | — 1 

= aq 2 "- 1 (l + <?"+...+ [g 2 "] 2 **" 1-1 ) < 2 a^~\ 

( 6 . 8 ) 
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From this we observe that ( x „) is a Cauchy sequence, since q < 1/2 and by 
Theorem 3.39 the limit 

x* = lim x v 

V — KX) 

exists. Passing to the limit v -» oo in (6.7) we obtain ||x* — x 0 || < r, i.e., 
x * £ B[x 0 ,r], and passing to the limit /x oo in (6.8) the error estimate 
of the theorem follows. 

4. We now show that the limit x* is a zero of the function /. With the aid 
of (6.4) and condition (a) we can estimate 

\\f(x»)\\ = ||/'(x„) (x„+i — X„)|| 

< II f'{x v ) - f'(x o) 4- /'(x 0 )|| ||x„ + i - x„|| 

< [7||x„ - x 0 || + ||/'(xo)||] ||x„ + i - x^ll -4 0, 1/ ^ 00. 

Hence /(x„) -4 0. u -> oo, and the continuity of / implies that indeed 

/(**) = o. 

5. We conclude the proof by showing that x* is the only zero of / in the 
ball B[xo,r]. For this we consider the function g : B[xo,r] -» IR n defined 
by 

g(x) := x - [f(x 0 )]~ 1 f(x). 

From conditions (b) and (c) and the inequality (6.6), by writing 
9(x) - g{y) = [/'(zo)] -1 {/(«/) - /(x) - f'(x 0 )(y - x)} 
we deduce that 



llff(») - 5(j/)ll < PirWy - x|| < 2q\\y - x|| 

for all x,y 6 B[xo,r]; i.e., g is a contraction. Therefore, by Theorem 3.44 
the function g has at most one fixed point in B[xo,r]. Now uniqueness 
of the zero of / in B[xo,r] follows from the equivalence of the equations 
g(x) — x and f(x) — 0. □ 

Our main application of Theorem 6.14 consists in deriving the following 
local convergence result for Newton’s method. 

Corollary 6.15 Let D C IR n be open and let f : D -» IR n be twice con- 
tinuously differentiable , and assume that x* is a zero of f such that the 
Jacobian f'{x*) is nonsingular. Then Newton’s method is locally conver- 
gent; i.e., there exists a neighborhood B of the zero x* such that the Newton 
iterations converge to x* for all xq £ B. 

Proof. Since / is twice continuously differentiable, by the mean value The- 
orem 6.7 applied to the components of /' there exists 7 > 0 such that 

ll/'(*) -/'(»)!! <7ll*-»ll 
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for all x,y in some closed ball B[x*,p] centered at x*. We write 

f'(x) = /'(*•){/ + [/'(X*)]- 1 [/'(*) - /'(X*)]} 

and deduce from the above estimate and Theorem 3.48 that the radius p 
of B[x*,p] can be chosen such that /'(#) is nonsingular on B[x*,p] and 
HIT ( x *)] -1 II < 0 for x £ B[ x *ip] an d some constant (3 > 0. 

Since / is continuous, f(x*) — 0 implies that there exists 6 < p/2 such 
that 

||/(*o)||<min{^,^-} 

for all ||xo - x*|| < 5. Then, after setting a := ||[/'(xo)]~ 1 /(xo)|| we have 
the inequalities 

a#7 < ||/(x 0 )||/3 2 7 < \ 

and 

2a < 2/?||/(x 0 )|| < 9 - . 

Hence for the open and convex ball B(x*,p) and for each xo with 
||xo — :r*|| < S the assumptions of Theorem 6.14 are satisfied. □ 

Corollary 6.16 Let f : (a, b) IR be twice continuously differentiable 
and assume that x* is a simple zero of f. Then Newton’s method is locally 
convergent. 

Proof. For simple zeros we have f(x*) ^ 0. □ 

Example 6.17 For the function f(x) x — cosx the Newton iteration 
reads 

x v — cos x v 

•*V+i • — x v ~z : : 

1 + sin x v 

and leads to the numerical values of Table 6.2. □ 



TABLE 6.2. Newton iterations for Example 6.17 



V 


x v 


0 


1.00000000 


1 


0.75036387 


2 


0.73911289 


3 


0.73908513 


4 


0.73908513 
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Example 6.18 For the function f(x) := x — e x 
reads 



x v - e 



x v+\ X v 1 x 

1 + e Xu 

and leads to the numerical values of Table 6.3. 



the Newton iteration 



□ 



TABLE 6.3. Newton iterations for Example 6.18 



V 


x u 


0 


1.00000000 


1 


0.53788284 


2 


0.56698699 


3 


0.56714329 


4 


0.56714329 



In both examples we observe that the speed of convergence is consider- 
ably improved as compared with the simple successive approximations of 
Examples 6.5 and 6.6. For a general description of this more rapid conver- 
gence of Newton’s method we need the following definition. 

Definition 6.19 A convergent sequence ( x u ) from a normed space with 
limit x is said to be convergent of order p > 1 if there exists a constant 
C > 0 such that 



\\xu+i ~ a;|| < C\\x v - x\\ p , v = l,2,.... 



Convergence of order one or two is also called linear or quadratic conver- 
gence , respectively. We note that the convergence in Banach’s fixed point 
Theorem 3.45 is, in general, linear. 



Theorem 6.20 Under the assumptions of Theorem 6.14 Newton’s method 
converges quadratically. 

Proof Using condition (b) of Theorem 6.14 and the inequality (6.5) we can 
estimate 



||x* - x v+1 \\ = ||ar* - x v + [f'(x v )] 1 f{x v )\\ 

< ll[/'(z,)r 1 U 11/00 - /(O - ~ Oil 



< ^ Ik* 



since f(x*) = 0. 



□ 
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Roughly speaking, the quadratic convergence of Newton’s method means 
that the number of correct digits in the numerical approximation is doubled 
in each iteration step, as observed in Examples 6.3, 6.4, 6.17, and 6.18. 
Although by this property Newton’s method is very attractive, it has to be 
observed that one step of the Newton iteration for nonlinear systems can be 
very costly both through the need for evaluating the entries of the Jacobian 
f'(x u ) and through the cost of solving the linear system to arrive at the 
new iteration x u +i. Therefore, a great variety of modifications of Newton’s 
method have been developed that mitigate, in particular, the first difficulty. 
These modified Newton methods , in general, are of the form 

Xv-\-i •— x v A^fixy), v — 0 , 1 , ... ; 

i.e., the inverse [/'(^i/)] -1 of the Jacobian is replaced by some approximat- 
ing matrix A v . Here we will only briefly mention two classical and simple 
possibilities for avoiding the evaluation of the Jacobian at each iteration 
step. 

In the simplified , or frozen , Newton method , for all steps the matrix A v 
is kept the same and chosen as the inverse of the Jacobian for the starting 
point; i.e., the iteration scheme is 

z„+i := x„ - [/'(zo)] -1 /^,,), v - 0 , 1, . . . . 

Geometrically, in the one-dimensional case this means that the tangent line 
of / at x v is replaced by the parallel to the tangent line of / at xo passing 
through (x u ,f(x v )). 

Theorem 6.21 Under the assumptions of Theorem 6.14 the simplified 
Newton method converges linearly to the unique zero of f in B[xo,r]. 

Proof. Recall that the function 

g(x) :=x- [/'(xo)] -1 /(z) 

defined in the proof of Theorem 6.14 is a contraction. We show that g maps 
B[xo,r] into itself. For this we write 

Xo - g(x) - [/'(zo)] -1 !/^) - f(x o) - f{x o)(z - Xo) + f(x o)}. 

Then estimating with the help of conditions (b), (c) and (d) and the in- 
equality (6.5) we obtain 

||^(x) - x 0 || < ^ \\x - x 0 || 2 + a < 2a 2 07 + a = (2q + l)a <2 a = r 

for all x with \\x — £o|| < r. Now the statement of the theorem follows from 
the Banach fixed point Theorem 3.46. □ 




110 6. Iterative Methods for Nonlinear Systems 

In the secant method for a function of one variable the derivative /'(£„) 
is approximated by the difference quotient and the corresponding iterative 
scheme is given by 

av + i := x v - - Xv _ v- 0,1,.... (6.9) 

J\ x v) J\ X V — l) 

Geometrically, this means that the tangent line at x u is replaced by the 
secant line through the two points x v and x v -\. Obviously, this method 
needs two initial elements xo and x\. Generalizations to functions in IR n 
are possible (see [47]). 

In general, for the simplified Newton method and for the secant method 
we can expect only linear convergence. The idea underlying the more so- 
phisticated modified Newton methods is to choose the approximating ma- 
trices A v in a manner leading to an improvement over linear convergence 
without requiring the computational costs of the full Newton method. In 
the so called rank one methods suggested by Broyden in 1965, in each it- 
eration step the matrix A v is updated from the previous matrix A v -\ by 
adding only a matrix of rank one such that the resulting iteration scheme is 
superlinearly convergent. Roughly speaking, the latter means that for the 
sequence x v -> x, v -4 oo, we have that 



ll^+i - z|| < C v \\x „ - x\\, v — 1,2, . . 

such that C v -> 0. v -> oo. For details we refer to the literature (see 
[20, 47]). 



6.3 Zeros of Polynomials 

In this section we shall apply Newton’s method to the computation of the 
zeros of polynomials. Finding the zeros of polynomials is a classical problem 
in mathematics and numerical analysis despite the fact that it very seldom 
occurs in applications. We first observe that Newton’s method also works 
for a complex function of a complex variable, allowing the computation of 
complex zeros. 

Consider the polynomial 

p(x) — a 0 x n + a\x n ~ l -f a 2 x n ~ 2 + b a n _ \x 4- a n 

with real or complex coefficients ao, a \ , . . . , a n . For the application of New- 
ton’s method, in each iteration step we need to compute the values of p and 
p' at the point x v . This can be effectively done by the Horner scheme. This 
is based on writing the polynomial in the form of nested multiplications 



p(z) = (• • • ({a 0 z + a\)z + a 2 )z -I h a n -i)^ + 
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which suggests the recursion 

~ bm — 1% “t - ^rri ■> TYl — 1, . . . , 71, (6.10) 

starting with bo = a^. Performing these n multiplications and additions, 
we arrive at the value of the polynomial p(z) = b n . 

For the polynomial 

Pi{x) := boX n ~ l + hx n ~ 2 + b 2 x n ~ 3 H b b n - 2 x + b n - 1, 

using (6.10) we compute 



n— 1 n 

Pi(x) (x - z) + b n = ^ b m x n ~ 1 ~ m (x - z) + b n = ^2 a mX n ~ m - p{x). 

m = 0 m = 0 

This implies that for a zero z the Horner scheme provides the coefficients 
of the polynomial obtained by dividing p by the linear factor x — z. In 
addition, we have that 

p'(x)=p' 1 (x){x-z)+pi{x), (6.11) 



and in particular, 

p'(z) =Pi(z)- 

Hence, applying the Horner recursion to the polynomial p\ yields the value 
of the derivative p'( z ). By repeating this process recursively, we can deter- 
mine all the derivatives of p at the point z, since by induction, from (6.11) 
we obtain that 



p(V(x) — p[ k \x) (x - z) + kp[ k ^(z), 



whence 

p( k \z) = kp[ k ~ X \z) 

follows for k = 1, . . . ,n. Therefore, defining recursively polynomials pk of 
degree n — k by applying the Horner scheme to the preceding polynomial 
Pk- i leads to 

p (k \z) = klpk(z), k = 1, . . . ,n. 

We can summarize this in the following theorem. 

Theorem 6.22 Let 

p(x) = aox n + a\x n ~ l + a 2 x n ~ 2 + • • • + a n -\x -f a n 
be a polynomial of degree n. For z £ C the complete Horner scheme 
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contains the derivatives 



b (k) P (k) (z) 

n ~ k k\ 



k = 0, 1, . . . , n, 



of the polynomial p at the point z. The scheme is recursively defined by 
bm • — fljyj , 7Ti — 0j . . . , n, and 



h (k) h (k- 1 ) 

°o °0 > 



■.= zb^l, + 



- 1 ) 



m = 1, . . . ,n — fc, 



/or k = 0, . . . , n. 

Example 6.23 For the polynomial p(x) := x 3 — x 2 -f 3x — 5 the Horner 
scheme 



z 


1-1 3-5 


2 


115 5 


2 


1 3 11 


2 


1 5 


2 


1 



for z = 2 leads to p(2) = 5, p r (2) = 11, p r, (2) = 10, p'"( 2) = 6. □ 

We continue by outlining how to compute all the zeros of a polynomial 
p of degree n with real coefficients. We first assume that p has only simple 
real zeros and proceed as follows: 

1. Either from analytic considerations or by plotting a graph of the 

polynomial we obtain a rough estimate of the location of the zeros 
Zn < Z n —i < < Z 2 < Zi. 

2. Starting with some xo > z i, by Newton iteration we compute the 
largest zero z\. The global convergence of Newton’s method in this 
case follows from monotonicity arguments (see Problem 6.13). 

3. By the Horner scheme we divide p by the linear factor x—z\ and carry 
out step two for the reduced polynomial to compute Z2 . Repeating 
this procedure, we successively obtain approximations for all zeros. 

4. In order to improve the accuracy, for all zeros Newton’s method is ap- 
plied to the full polynomial p with the starting points of the iteration 
given by the approximations obtained in step three. 
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Now we consider the case of multiple real zeros. If z is a zero of order m, 
then we can write 

p(x) = (x- z) m q{x), (6.12) 

where the polynomial q of degree n — m has a value q(z) ^ 0. To see the 
effect of (6.12) on Newton’s method we consider it as a fixed-point iteration 
x v +\ := g(x u ) with g defined by 



9(x) 



, = P(») 

p'{x) ' 



Using (6.12), by elementary differentiation we obtain 






Therefore, by Theorem 6.2, at a multiple zero Newton’s method is locally 
convergent. Obviously, the convergence at a multiple zero is only linear. 
However, one can modify Newton’s method for multiple zeros such that 
the quadratic convergence is preserved (see Problem 6.14). 

For finding complex zeros, in principle one can apply Newton’s method 
in C. For this one has to keep in mind that for polynomials with real coef- 
ficients, the starting values need to be complex, since otherwise Newton’s 
method would produce only real approximations. For the conjugate com- 
plex zeros of a polynomial with real coefficients Bairstow ’s method avoids 
working in the complex plane by using the fact that for two conjugate zeros, 
the product of the linear factors (x — z)(x — z) is a polynomial of degree 
two with real coefficients. The basic idea is to write the polynomial p of 
degree n in the form 



p(x) = (x 2 — ux — v)q(x) + a(x — u) + fc, 



where q is a polynomial of degree n — 2, and a and b are constants depending 
on tx, v G IR. The factor x 2 — ux — v corresponds to two conjugate complex 
zeros of p if the pair u, v solves the nonlinear system a(u , v) = 0, b(u , v) = 0. 
The latter can be solved by Newton’s method, and once the solution u,v is 
known, the two zeros of p are obtained by solving the quadratic equation 
x 2 — ux — v — 0. 

We conclude this section with some consideration of the question of sta- 
bility. In particular, we show that the zeros of polynomials can be quite 
sensitive to small changes in the coefficients even if all the zeros are simple 
and well separated from each other. 

Let p and q be polynomials of degree n and assume that zq is a simple 
zero of p. Consider the perturbed polynomial 



p(-,e) ~p + eq, 

where e is small. Using the theory of functions of a complex variable, it can 
be shown that in a neighborhood of e = 0 the zero z{e) depends analytically 
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on the parameter e. The derivative z* can be obtained by differentiating 
p[z(e),e] = 0 with respect to e. This yields 



{p'[z(e)] + eq'[z(e)]}z'(e) + q[z{e)\ = 0, 



and setting e = 0, it follows that 



z'(0) = - 



q(zo) 

p'(zo) ■ 



Hence, for small e we have that 



z(c) * z °- e M- (613) 

Example 6.24 The polynomial 

p(x) := (x - 1) (x - 2) * • • (x - 10) = x 10 - 55a: 9 + h 10! 

has the zeros 1, 2, . . . , 10, which are well separated from each other. We 
perturb the coefficient of x 9 by choosing q(x) := 55a: 9 . Since p'(10) = 9!, 
by (6.13), the zero zo = 10 of the polynomial p is perturbed into 

10 ~ 55 g!° e » 10 - 1.5 • 10 5 e. 

This illustrates that finding the zeros of p is an ill-conditioned problem and 
that a reliable approximation of the zeros is impossible. □ 



6.4 Least Squares Problems 

Quite often the problem of solving a system of nonlinear equations may 
be replaced by an equivalent problem of minimizing a function and vice 
versa. We illustrate this by introducing the Levenberg-Marquardt method 
as one of the most effective procedures for solving nonlinear least squares 
problems. 

Let g : lR n — > IR be a twice continuously differentiable function and 
consider the problem of minimizing g. Let xo be an approximation for a 
local minimum of g. In a neighborhood of xo, by Taylor’s formula we may 
approximate 

g(x) « g(x 0 ) + (x- x 0 ) T gradg{x 0 ) + 1 (x - x 0 ) T g"(x 0 )(x - x 0 ), (6.14) 




where 
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denotes the Hessian matrix of g. Minimizing the quadratic function on the 
right-hand side of (6.14) yields 

Xi = Xq - [9"(ar 0 )] _1 gradg(a;o) (6.15) 

as a new approximation for the minimum of g. We observe that (6.15) ob- 
viously coincides with one Newton step for solving the necessary condition 
grad g(x) = 0 for a local minimum. 

However, if (6.14) is only a very poor approximation to g , then we expect 
the Newton step (6.15) not to be very effective. In this case it is more 
appropriate to use a so-called method of steepest descent; i.e., choose 

x\=xq — AMgrad^(xo) (6.16) 

as a new approximation. Here M is a positive definite matrix, and the 
step size A > 0 is chosen such that g(x i) < g(x o) is satisfied. This can be 
achieved, since by Taylor’s formula we have that 

g[x 0 - AM grad ^(xo)] « 5 (^ 0 ) - A[grad^(x 0 )] r M grad^zo) 

and M is assumed to be positive definite. 

After introducing the vector y G IR n and the n x n matrix A by 

y >' 111 := ■ W, {z) ■ (x> := aOjk (l) ' (6 ' 1 71 

we can rewrite the Newton iteration (6.15) as the linear system 



A(x 0 )(xi - x 0 ) = y, (6.18) 

which we have to solve for the difference x\ — xq. Similarly, one step of the 
steepest descent (6.16) can be transformed into 

xi - xq = XMy. (6.19) 

Now recall the least squares problem of Example 2.4. In a slight refor- 
mulation, this problem consists in minimizing the function 

m 

g(x) ■■= ^T[fi(x) - Uif 

i= 1 

over some domain D , where the /* : D -» IR are given functions and the 
Ui € IR are given constants for i = 1, . . . , m. We compute the derivatives 

^(*> = 2 £[/,(*) 

J i— 1 J 
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and 



d 2 g 

dxjdxk 




dx k 



(x) + [fi(x) - Ui ] 




In this case the matrix ( ajk ) contains second derivatives of the functions 
fi. However, since these derivatives are multiplied by the factor [fi{x) — Ui], 
which will become small by minimizing <7, it is justified to neglect this term. 
Note that if Newton’s method converges, it always will converge to a zero, 
even if we do not use the exact Jacobian for the computation, provided that 
the approximate Jacobian at the limit is nonsingular. Hence, we simplify 
and replace (6.17) by 

a jk(x) •*= 2 (z) (x) (6.20) 



and note that ajj(x) > 0. 

Now the Levenberg-Marquardt method combines (6.18) and (6.19) by 
first introducing the n x n matrix A = (ajk) with entries 

^jj ~ (1 "b 7 ) a jj , > ajk — ajk , j 7^ k , 

where 7 is some positive parameter, and then replacing (6.18) and (6.19) 
by 

A( x o)(xi ~ x 0 ) = y • (6.21) 

Obviously, for large 7 the matrix A will become diagonally dominant, and 
(6.21) will get close to the steepest descent, with 

M = diag (— , . . . , — ) 

\®n a nn j 

and A = I/7. For 7 — ► 0, on the other hand, (6.21) will turn into the Newton 
step (6.18). This ability to gradually vary between Newton’s method and 
the steepest descent method is one of the basic features of the Levenberg- 
Marquardt method, which we describe as follows: 

1. Choose an initial guess xo , some moderately sized value for 7, and a 
factor a, say 7 = 0.001 and a = 10. 

2. Solve the linear system (6.21) to obtain x\. 

3. If #(27) > <?(#o) 5 then reject x\ as a new approximation, replace 7 by 
ay, and go back and repeat step two. 

4. If g(x 1) < #(#o), then accept xq as a new approximation, replace Xq 
by X\ and 7 by 7/a, and go back to step two. 

5. Terminate when the difference \g(x\) — <7(#o)| is smaller than some 
given tolerance. 

For a detailed analysis of this method we refer to [44]. For a study of 
nonlinear optimization methods and their relation to nonlinear systems we 
refer to [20]. 




Problems 117 



Problems 



6.1 Prove Brouwer’s fixed point theorem in 1R; i.e., show that if D C 1R is a 
closed and bounded interval and if / : D — ► D is continuous, then / has a (not 
necessarily unique) fixed point. 

6.2 Draw figures illustrating monotone or alternating divergence of the succes- 
sive iterations for a fixed point of a function of one variable. 

6.3 Show how to solve the equation tana; = x by successive approximations. 

6.4 Show that 

lim ^ 2 -f- \J 2 H + y/2 = 2 . 

V— »OO v- ^ v 

v square roots 

6.5 Let D C IR be an open interval and let / : D — ► D be m times continuously 
differentiable. Under the assumption that the sequence Xu+i := f(x u ) converges 
to some x in D with f'(x) = f'(x) = ••• = f^ m ~ 1 \x) = 0, show that the 
convergence is of order m. 

6.6 Let the sequence (x u ) in IR converge to x such that x u / x for all v € IN 
and 

Xu+i - x = (q + £u)(xv -x), v = 0, 1, ... , 
where |g| < 1 and £„ — > 0, v -» oo. Show that 

( Xv-\-i a?i/) 

y v := Xu 

Xu -f-2 2Xu-\- 1 H - Xu 

is well-defined for sufficiently large v and that 

lim = 0; 

v — y oo Xu — X 

i.e., the sequence ( y u ) converges to x more rapidly than the sequence (a;„). 
This method for speeding up the convergence of sequences is known as Aitken’s 
S 2 method. 



6.7 Let D C IR be an open interval, let / : D -» IR be twice continuously dif- 
ferentiable, and let x be a fixed point of / with f'(x) ^ 1. Show that Steffensen’s 
method 



X v + 1 •*= X u ~ 



[ f(x„) - Xu] 2 
/[/(*!/)] - 2 f(x u ) +x„ 



" — 0 , 1 , , 



is locally and quadratically convergent to the fixed point x (see Problem 6.6). 



6.8 Discuss Steffensen’s method of Problem 6.7 for the fixed point x = 0 of the 
function f(x) := 2x + x 3 . 



6.9 Show that 



Xu+l • — 



Xu{x 2 + 3a) 



i/ = 0,l,. 



3x u “h a 

is a method of order three for computing the square root of a positive number a. 
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6.10 Prove an analogue of Corollary 6.16 for the secant method (6.9). 

6.11 Give conditions for monotone convergence of Newton’s method for a func- 
tion of one variable. 

6.12 Show that Newton’s method for the function f(x) x n — a, x > 0, where 
n > 1 and a > 0, converges globally to a 1//n . 

6.13 Assume that the polynomial p with real coefficients has only real zeros 
and denote the largest zero by z \ . Show that for any initial point xq with xq > z\ 
Newton’s method converges to z\. 

6.14 Assume that 2 is a zero of order m of the polynomial p. Show that 

p{x v ) n 1 

x v+ l := X v - 171 — : r, l/ = 0, 1,..., 

P {Xv) 

converges locally and quadratically to the zero 2 . 

6.15 Show that for a nonsingular n x n matrix A the sequence 

Av+ 1 := A„[2I - AAv\, v = 0, 1, ... , 
converges quadratically to the inverse A -1 , provided that || I — AAo|| < 1. 

6.16 Write a computer program for finding n simple zeros of a polynomial of 
degree n with real coefficients. Use this code for the computation of the zeros of 
the Laguerre polynomial La(x) = a: 4 — 16a: 3 -h 72x 2 — 96a; -f 24. 

6.17 Show that for the function / : (0, 00) — > JR, given by 




the Newton iterations starting with xo = 1 converge and that the limit, however, 
is not a zero of /. 

6.18 The eigenvalue problem Ax = \x for an n x n matrix A is equivalent to 
the equation f(z) = 0, where / : IR n x JR — > IR n x IR is defined by 

. f x \ ( Ax - \x \ 

f \ A )^{ XX T-1 )■ 

Write down Newton’s method for this equation. 

6.19 Write a computer program for solving a least squares problem by the 
Levenberg-Marquardt method. 

6.20 The set of all points (G C for which the fixed point iteration z v + 1 := 2 2 +C 
starting with zo = 0 remains bounded is called the Mandelbrot set Write a 
computer program for visualizing the Mandelbrot set. 




Matrix Eigenvalue Problems 



Many problems in science and engineering lead to eigenvalue problems for 
matrices. These occur either directly or by discretization of eigenvalue prob- 
lems for differential or integral operators. In the latter case the size of the 
matrices will be rather large. It is the purpose of this chapter to intro- 
duce some of the main ideas in matrix eigenvalue computations without 
attempting to be comprehensive. For a more detailed study we refer to 
[27, 65]. 

For the numerical computation of matrix eigenvalues we have to distin- 
guish between two groups of methods: 

1. In the so-called direct methods the eigenvalues are obtained as zeros 
of the characteristic polynomial. 

2. In contrast, iterative methods approximate the eigenvalues through a 
successive approximation procedure without using the characteristic 
polynomial. 

Since, as illustrated in Example 6.24, the computation of zeros of poly- 
nomials of high degree tends in general to be ill-conditioned, in practice it- 
erative methods are used almost exclusively. In this chapter we will discuss 
the two most important methods of this class, namely the Jacobi method 
and the QR algorithm . In the last section we will also briefly describe the 
Hessenberg method as an example of a direct method. 

A key factor in all eigenvalue computations is the fact that similarity 
transformations leave the eigenvalues of a matrix invariant; i.e., for a given 
matrix A the matrices A and C~ x AC have the same eigenvalues for all 
nonsingular matrices C. This can be seen either from the equivalence of 
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the equations 



Ax = \x and (rMCJC" 1 ! = \C~ l x 

or from the multiplication theorem for determinants 

det(AJ - A) = det[C -1 (A/ - A)C\ = det(A I - C~ l AC)-, 

i.e., similar matrices have the same characteristic polynomial. This invari- 
ance allows one to transform a given matrix A by a similarity transfor- 
mation into a matrix of simpler form with the same eigenvalues as A. In 
particular, the iterative methods successively construct sequences of similar 
matrices that converge to a diagonal matrix or an upper (or lower) trian- 
gular matrix from which the eigenvalues can be read off as the diagonal 
elements. 



7.1 Examples 

We begin by illustrating how the discretization of eigenvalue problems for 
differential operators leads to eigenvalue problems for large matrices. 

Example 7.1 The vibrations of a string are modeled by the so-called wave 
equation 

d 2 w 1 d 2 w 

dap = ’ 

where w = w(x,t) denotes the vertical elongation and c is the speed of 
sound in the string. Assuming that the string is clamped at x = 0 and 
x = 1, the boundary conditions w(0,t) = w(l,t) = 0 must be satisfied for 
all times t. Obviously, the time-harmonic wave 

w(x,t) = v(x)e Mt 

with frequency u solves the wave equation, provided that the space-dependent 
part v satisfies 

— v n = \v on [0,1], 

where A := uj 2 /c 2 . The boundary conditions w(0,t) = w(l,t) = 0 are 
satisfied if v satisfies the boundary conditions 

u(0) = v(l) = 0. 

Hence, introducing the linear space 

U := {v € C[ 0, 1] : v is twice continuously differentiable, v(0) = u(l) = 0} 

and defining the differential operator D : U -* C[0, 1] by D : v i-» —v", 
we are led to the eigenvalue problem Dv = Xv. Elementary calculations 
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show that the functions v m {x) = sinra7nr are eigenfunctions of D with the 
eigenvalues A m = ra 2 7r 2 for m — 1,2, — It can be shown that these are 
the only eigenvalues and eigenfunctions of D. 

For discussing an approximate solution we consider the slightly more 
general differential equation 

- v " +pv = \v on [0, 1] 

with boundary conditions v(0) = u(l) = 0, where p € C[ 0, 1] is a given pos- 
itive function. We can proceed as in Example 2.1 and choose an equidistant 
mesh xj = jh , j = 0, . . . , n + 1, with step size h = 1 /(n + 1) and n E IN. At 
the internal grid points Xj, j = 1, . . . , n, we replace the differential quotient 
by the difference quotient 

v "( x i) » i v ( x i+ 1) “ + v ( x i-i)} 

to obtain the system of equations 

jja {-*>>- 1 + - «i+i } + Pj v j =*vj, j = 1, . . . , n, 

for approximate values to the exact solution v(xj). Here, we have set 

Pj := p(xj) for j = 0, . . . ,n + 1. This system has to be complemented by 
the two boundary conditions Vo = v n +i = 0. For an abbreviated notation 
we introduce the n x n tridiagonal matrix 

\ 

-i 

2 + h?p% — 1 

— 1 2 + h?p n — i —1 

V -1 2 + h 2 p n J 

and the vector u — (ui, . . . , v n ) T . Then the above system of equations, 
including the boundary conditions, reads 

An — A u\ 



( 2 + h?p\ 
-1 



1 

7? 



-l 

2 -f h?p2 
-1 



i.e., the eigenvalue problem for the differential operator D is approximated 
by the eigenvalue problem for the matrix A. □ 

The important question as to how well the matrix eigenvalues approx- 
imate the eigenvalues of the differential operator and whether we have 
convergence of the eigenvalues as h — > 0 is beyond the scope of this book 
(see Problem 7.2). The example is meant only as an illustration of the fact 
that eigenvalue problems for large matrices arise through the discretiza- 
tion of eigenvalue problems for ordinary differential operators and also for 
partial differential operators. In the same spirit, eigenvalue problems for 
integral operators can be approximated by matrix eigenvalue problems, as 
indicated in the following example. 
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Example 7.2 Consider the eigenvalue problem 

[ K ( x , y)tp(y) dy = \<p{x), x € [0, 1], 

JO 

for a linear integral operator with continuous kernel K. For the numerical 
approximation we proceed as in Example 2.3 and approximate the integral 
by the rectangular rule with equidistant quadrature points Xk = k/n for 
k == 1, . . . ,n. If we require the approximated equation to be satisfied only 
at the grid points, we arrive at the approximating system of equations 

1 x n A 

-'E,K(x j ,x k )< Pk =\< P j, j = l,...,n, 

n fc=i 

for approximate values tpj to the exact solution <p(xj). Hence, we approx- 
imate the eigenvalues of the integral operator by the eigenvalues of the 
matrix with entries K(xj,Xk)/n. Of course, instead of the rectangular rule 
any other quadrature rule can be used. A discussion of the convergence of 
the matrix eigenvalues to the eigenvalues of the integral operator is again 
beyond the aim of this introduction. □ 



7.2 Estimates for the Eigenvalues 

At this point we urge the reader to recall the basic facts about eigenvalues 
of matrices, in particular those that were presented in Section 3.4. In the 
sequel, by (• , •) we denote the Euclidean scalar product in <D n and by || • || 2 
the corresponding Euclidean norm. 

The eigenvalues of Hermitian matrices can be characterized by the fol- 
lowing maximum principles. These can be used to get some rough estimates 
for the eigenvalues. Note that for the eigenvalues of Hermitian matrices the 
geometric and the algebraic multiplicity coincide (see Problem 7.4). 

Theorem 7.3 (Rayleigh) Let A be a Hermitian nxn matrix with eigen- 
values 

Ai > A 2 > • > A n 

(where multiple eigenvalues occur according to their multiplicity) and cor- 
responding orthonormal eigenvectors x \ , x 2 , . . . , x n . Then 

(Ax,x) . 

Xj = max — 3 = 1, . . . ,n, 

xeVj (x,x) 

x 7^0 

where the subspaces Vi , . . . , V n are defined by Vi C" and 

Vj {x € <D n : (x,x fc ) =0, k = l,...,j - 1}, j = 2,...,n. 
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Proof. Let x € Vj with x / 0. Then 

n n 

x = 5Z (*,**)«* and ^|(g,g t )| 2 = (ff, a:)- 

k-j k=j 

Hence 

n 

Ax = Y^h(x,x k )x k 
k=j 



and 

n n 

(Ax,x) = ^Afc|(x,x fe )| 2 < A i ^|(* > * fc )| 2 = Aj(ar,x). 

k=j k—j 



This implies 



sup 

z<EV; 

x^O 



(Ax,x) 

(x,x) 



<*J, 



and the statement follows from (Axj,Xj) = A j and Xj E Vj. 



□ 



This maximum principle can be used in a simple manner to obtain lower 
bounds for the largest eigenvalue of Hermitian matrices. For the matrix 



A = 



1 3 2 
3 5 1 

2 1 4 



by using x — (1, 1 , 1) T we find the estimate Ai > 7.33 as compared to the 
exact eigenvalue A x = 7.58 — Using x = (1,2, 1) T leads to the estimate 
Ai > 7.50. 

Using Rayleigh’s principle to obtain bounds for the smaller eigenvalues 
requires the knowledge of the eigenvectors for the preceding larger eigen- 
values. This problem is circumvented in the following minimum maximum 
principle. 



Theorem 7.4 (Courant) Let A be a Hermitian n x n matrix with eigen- 
values 

Ai > A 2 > • . . > A n 

(where multiple eigenvalues occur according to their multiplicity). Then 



(Ax,x) 

u = mm max — , 

UjeM jxe Uj (x,x) 

x^O 



j = l,..,n, 



where Mj denotes the set of all subspaces Uj C C n of dimension n 4- 1 — j. 
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Proof. First we note that because of 

{Ax,x) (A \ 

sup — r~ = sup ( AX,X ) 

xet/j ( x i x ) xeUj 

x^O (x,x) = l 

and the continuity of the function x h* (Ax,x), the supremum is attained; 
i.e., the maximum exists. 

By xi,X 2 , . . . , £ n we denote orthonormal eigenvectors corresponding to 
the eigenvalues Ai > A 2 > > A n . First, we show that for a given 

subspace Uj of dimension n + 1 — j there exists a vector x G Uj such that 

(x,x k )-0, k — j + l, . . . ,n. (7.1) 

Let z 1 , . . . , z n +\—j be a basis of Uj. Then we can represent each x G Uj by 

n+l—j 

X = ^ ^ CLiZ{. ( 7 * 2 ) 

i = 1 

In order to guarantee (T.l), the n + l— j coefficients ai, . . . , a n _|_i_j must 
satisfy the n — j linear equations 
n+l—j 

^2 a i( z i, x k) = 0, k = j + l,...,n. 

i=l 

This under determined system always has a nontrivial solution. For the 
corresponding x given by (7.2) we have and from 



x - 



k = 1 



we obtain that 



(Aar, ar) = ^ A*|(ar, arjfc)| 2 > A, |(x,x fc )| 2 = A j(x,x), 



whence 



k=l k = 1 



(Aar, ar) 

max — — > A j 



x€Uj (x, x) 
x^O 

follows. 

On the other hand, for the subspace 

Uj = {x G C n : (*,**) = 0, k = 1, . . . , j - 1} 
of dimension n + 1 — j, by Theorem 7.3 we have the equality 

(Ax,x) 



max 



x€Uj (X,X) 
x^O 



— A j, 



and the proof is finished. 



□ 
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Corollary 7.5 Let A and B be two Hermitian n x n matrices with eigen- 
values Ai(A) > M^A) > • • • > A n (i4) and X\ (B) > A 2 (B) > • • • > A n (B). 
Then 

for any norm || • || on C n . 

Proof. From the Cauchy-Schwarz inequality we have that 

(Ax - Bx,x) < ||(^ - B)x || 2 \\x\\ 2 < \\A - B\\ 2 \\ x \\* 



and hence 

(Ax, x) < ( Bx , x) + || A - S|| 2 ||a;|||. 

By the Courant minimum maximum principle of Theorem 7.4 this implies 
Aj(-A) < Aj(B) + ||A — B|| 2 , j = 1 . . . , n. 

Interchanging the roles of A and B , we also have that 

A j(B) < Xj(A) -f || B - A\\ 2 , j = 1 . • . ,n, 



and therefore 



Now the statement follows from 

\\A-B\\ 2 = p(A-B)<\\A-B\\, 

which is a consequence of Theorems 3.31 and 3.32. □ 

Corollary 7.6 For the eigenvalues X\ > X 2 > • • • > A n of a Hermitian 
n x n matrix A = ( ajk ) we have that 

n 

\K -a'u? < XI \ a i k \ 2 ’ i = 

j,k = 1 

where the elements o! lx , . . . ,o! nn represent a permutation of the diagonal 
elements an, ... , a nn of A such that a' n > > • • • > a' nn . 

Proof. Use B = diag(a'j) and || • || — || • || 2 in the preceding corollary. □ 

We conclude this section with an extension of the above results to general 
matrices that gives a rough estimate as to where in C the eigenvalues are 
located. 
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Theorem 7.7 (Gerschgorin) Let A — ( ajk ) be a complex n x n matrix 
and define the disks 



Gj \ A G (D : |A ajj | < ^ \djk\ 



k = 1 
k^j 






and 



G* := <1 



A E C : | A ajj | < ^ | akj | / , j — 1, . . • , n. 



k — 1 



Then the eigenvalues X of A satisfy 



AeUG.nyG*. 

j - 1 3 - 1 



Proo/. Assume that Ax = Ax and ||x||oo = 1, and for x = (xi, . . . ,x n ) T 
choose j such that \xj\ = ||x||oo = 1. Then 



I A ajj\ — | (A a jj) x j 



n 



a jk x k 

k = 1 
k^j 



fc=l 

k^j 



and therefore 

n 

A e U 

1 

Since the eigenvalues of A* are the complex conjugate of the eigenvalues of 
A (see Problem 7.3) we also have that 

A € U G h 

3 = 1 



and the theorem is proven. 



□ 



7.3 The Jacobi Method 

The method described in this section was discovered by Jacobi in 1846 and 
can be used to iteratively compute all the eigenvalues and eigenvectors of 
real symmetric matrices. 
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Lemma 7.8 The Frobenius norm 



WMf := 



£ M 2 



jjk=l 



1/2 



of an n x n matrix A = (a^) w invariant with respect to unitary transfor- 
mations. 

Proof. The trace 

n 

tr A:— ^ a,jj 
j = i 

of a matrix A is commutative; i.e., tr AB = tr BA. This follows from 

n n n n n n 

£(^?);;=££ a jkbkj — y! bkjajk — 

i=l j=l fc=l j=l fc=l 

In particular, we have that 

tr AA* = EE a jk a kj — yz i a j*i 2 - 

i=l As=l j=l *=1 

Therefore, for each unitary matrix Q it follows that 

\\Q*AQ\\ 2 f = tv(Q*AQQ*A*Q) - tr(QMA*g) = tr(A4*QQ*) = P|&, 

and the lemma is proven. □ 

Corollary 7.9 The eigenvalues of an n x n matrix A (counted repeatedly 
according to their algebraic multiplicity) satisfy Schur’s inequality 

t\^\ 2 <\\A\\l. 

3 = 1 

Equality holds if and only if the matrix A is normal, i.e., if AA* = A* A. 

Proof. By Theorem 3.27 there exists a unitary matrix Q such that 
R := Q*AQ is an upper triangular matrix. Hence 

\\A\\ 2 F = \\r\\ 2 f = £ |a>| 2 + £ £ M 2 , (7.3) 

j— 1 j= 1 k=j+ 1 

since the diagonal elements of R — ( rjk ) coincide with the eigenvalues of 
the similar matrices R and A. Now Schur’s inequality follows immediately 
from (7.3). 
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For the discussion of the case of equality, we first note that any unitary 
transformation of a normal matrix is again normal. This is a consequence 
of the identity 

Q*AQ(Q*AQ)* - (Q*AQ)*Q*AQ = Q*(AA* - A*A)Q . 

If equality holds in Schur’s inequality, then (7.3) implies that R is a diagonal 
matrix. Hence R , and therefore A , is normal. 

Conversely, if A is normal, then the upper triangular matrix R must also 
be normal. Now, from 

(RR*)jj = j>2r jk r* kj = E |r ifc | 2 

k = 1 k~j 

and 

(«*«)« =E r i* r *i = E Ir*/ 

k=l k=l 

we conclude that 

EM 2 = X]Kj| 2 , j = 

k=j k = 1 

This implies rjk = 0 for j < A;, i.e., R is a diagonal matrix, and from (7.3) 
we deduce that equality holds in Schur’s inequality if A is normal. □ 

For any n x n matrix A = ( ajk ) we introduce the quantity 

\ 1/2 

E l a j*| 2 (7-4) 

,fc = l , 

/ 

as a measure for the deviation of A from a diagonal matrix. 

Lemma 7.10 Normal matrices A satisfy 

EiA.i^EKip + iiv^)] 2 . 

j = i i=i 

Proof. This follows from Corollary 7.9. □ 

The main idea of the Jacobi method for real symmetric matrices is to 
successively reduce N (A) by elementary plane rotation matrices such that 
in the limit the matrix becomes diagonal (with the eigenvalues as diagonal 
entries) . 
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Lemma 7.11 For each pair j < k and each p £ IR the matrix 



{ i 



\ 



u = 



cos ip —sin p 

sin p cos ip 



\ 



i/ 



which coincides with the identity matrix except for ujj — Ukk — cos ip and 
Ukj = —Ujk = sin ip (and which describes a rotation in the XjXk-plane) is 
unitary. 



Proof. This follows from 



and 



( 

( 



cos p 


— sin p \ 


f cos p 


siny? 


) = ( 1 


0 


sin p 


cos p J 


\ — sin p 


cos p 


) V 0 


i 


cos p 


sin p \ 


( cos p - 


- sin<£> 


W 1 


0 


— sin<£> 


cos p ) 


y sin p 


cos p 


J V 0 


1 



Lemma 7.12 Let A be a real symmetric matrix and let U be the unitary 
matrix of Lemma 7.11. Then B = U*AU is also real and symmetric and 
has the entries 



bjj 


— a jj 


cos 2 p 4 - aj k sin 2 p 4 - a kk 


sin 2 p, 


b kk 


ii 

£ 

Vi. 


sin 2 p - aj k sin 2 p 4 - a kk 


cos 2 p, 


bj k 


II 


= a jk cos 2 p 4- ^ (a k k ~ 


ajj) sin 2 p, 


bij 


— bji 


= aij cos p 4- ai k sin p, 


i ± h 


bi k 


— b k i 


= —aij sin p + a ik cos p, 


i ^ h k, 


bu 


— an , 


ij ^ j,k\ 





i.e. f the matrix B differs from A only in the jth and kth rows and columns. 

Proof. The matrix B is real, since A and U are real, and it is symmetric, 
since the unitary transformation of a Hermitian matrix is again Hermitian. 
Elementary calculations show that 

/ cos ip sin ip \ / ajj aj k \ / cos p —sin p \ _ f bjj bj k 

\ -sin 9 ? cos P J \ Q'kj akk J \ sin p cos p J \ b k j b kk 
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with bjj , bjk, bkj , and b kk as stated in the theorem. For i j,k we have 
that 

n 

b{j — ^ ^ UjgdgfU'pj — Q’ij'U'jj *t~ Q'ik'U'kj — COS p CLik sin p 
r,s= 1 

and 

n 

b{k — ^ ^ 'U'i s (lsr'U>rk — Q'ij^'jk "P ^ik^kk — ij sin p H~ G>ik COS p. 

r,s=l 

Finally, we have 

n 

bil — ^ ^ ^is^sr^rl ~ a il 
r,s=l 

for i, l / j , A:. □ 

Lemma 7.13 For 

2ajk i 

tan 2 p — , djj ^ a kk > 

® jj Q>kk 

7T 

^ ^ j a jj ~ a kki 

the transformation of Lemma 7.12 annihilates the elements 

bjk bkj — 0 

and reduces the off-diagonal elements according to 
[N{B)f = [N{A)f -2a%. 

Proof, bjk = bkj = 0 follows immediately from Lemma 7.12. Applying 
Lemma 7.8 to the matrices 

( a n a jk A i ( bjj bjk \ 

V a kj a kk ) V bkj bkk ) 

yields 

a )j + 2a j* + a>lk - b 2 jj + b 2 kk . 

From this, with the aid of Lemmas 7.8 and 7.12 we find that 
[AT(B)]2 = ||B||a - f^bl = \\A\\l - £&?< 

i=l i= 1 



= [N(A)]> + £(4 ~ b%) = [iV(^)] 2 - 2 a% 

i= 1 



which completes the proof. 



□ 
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Note that the quantities required for the computation of the elements of 
the transformed matrix can be obtained by the trigonometric identities 



cos 2 (p = 



1 

\J\ + tan 2 2 ip ’ 



cos ip — y - (1 4- cos2</?), sin<^ = y - (1 — cos2y>). 

The sign of the root in the expression for sin ip has to be chosen such that 
it coincides with the sign of tan 2 ip. 

The classical Jacobi method generates a sequence {A v ) of similar matri- 
ces by starting with the given matrix A$ A and choosing the unitary 
transformation at the uth step according to Lemma 7.13 such that the non- 
diagonal element of A v _\ with largest absolute value is annihilated. It is 
obvious that the elements annihilated in one step of the Jacobi iteration, 
in general, do not remain zero during subsequent steps. However, we can 
establish the following convergence result. 

Theorem 7.14 The classical Jacobi method converges; i.e ., the sequence 
(A u ) converges to a diagonal matrix with the eigenvalues of A as diagonal 
elements. 

Proof. For one step of the Jacobi method, from 

[N(A)] 2 < ( n 2 — n) max a 2 it 

i J=l,...,n 
i^l 



we obtain that 



“jk> 



[N(A)}2 
n(n — 1) 



for the nondiagonal element ajk with largest modulus. Hence, from Lemma 
7.13 we deduce that 



where 



[JV(B)] 2 = [jV(A)] 2 - 2 < q 2 [N(A)}\ 

1/2 






n(n — 1) , 

For the sequence (A u ) this implies that 

N(A„) < q u N{A 0 ) 

for all v € IN, whence N{A V ) — ► 0, v -» oo, since q < 1. 



□ 
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Note that for large n the value of q is close to one, indicating a slow 
convergence of the Jacobi method. Writing A v — (ajk,v) by Corollary 7.6 
we have the a posteriori error estimate 

\\j Q'jj,v\ AT(A|,), j — 1, . . . , n, 

after performing v steps of the Jacobi method. Further error estimates can 
be derived from Gerschgorin’s Theorem 7.7. 

Approximations to the eigenvectors can be obtained by successively mul- 
tiplying the unitary transformations of each step. We have A v — Q*AQ ^ , 
where Q v — U\ • • • V v is the product of the elementary unitary transforma- 
tions for each step. From 

A V ^D = diag(Ai,...,A n ) 

it follows that AQ V « Q V D. Hence the columns Q v = (ui, . . . ,u n ) of Q v 
satisfy Auj « A jUj for j = l,...,n; i.e., they provide approximations to 
the eigenvectors. 

In each step, the classical Jacobi method requires the determination of 
the nondiagonal element with largest modulus. In order to reduce the com- 
putational costs, in the cyclic Jacobi method the nondiagonal elements are 
annihilated in the order 

(1) 2), • • . , (1, n), (2, 3), . . . , (2, n), (3, 4), . . . , (n — 1, n) 

independent of their size. Convergence results can also be established for 
this variant (see [27]). 

A further refinement is to choose a constant threshold and to annihilate 
in each cyclic sweep only those off-diagonal elements that are larger in 
absolute value than the threshold. Of course, the threshold needs to be 
lowered after each sweep, i.e., after performing a full cycle. For details we 
refer to [48, 65]. 

Example 7.15 For the matrix 




the first six transformed matrices 
by 


for the classical Jacobi method are given 


/ 1.0000 


0.0000 


-0.7071 \ 


Ai = 0.0000 


3.0000 


-0.7071 , 


\ -0.7071 


-0.7071 


2.0000 J 


/ 0.6340 


-0.3251 


0.0000 \ 


A 2 = -0.3251 


3.0000 


-0.6280 , 


V 0.0000 


-0.6280 


2.3660 / 




7.4 The QR Algorithm 133 



/ 0.6340 

A 3 = -0.2768 

\ -0.1704 



/ 0.6064 

A 4 = 0.0000 

\ -0.1695 



/ 0.5858 

A 5 = 0.0020 

\ 0.0000 



/ 0.5858 

A 6 = 0.0020 

\ - 0.0000 

The exact eigenvalues of A are Ai 



-0.2768 -0.1704 \ 
3.3864 0.0000 , 

0.0000 1.9796 / 



0.0000 -0.1695 \ 
3.4140 0.0169 , 

0.0169 1.9796 ) 



0.0020 0.0000 \ 
3.4140 0.0168 , 

0.0168 2.0002 ) 



0.0020 - 0.0000 \ 

3.4142 0.0000 . 

0.0000 2.0000 / 

2 + >/2, A 2 = 2, A 3 = 2 — \/2. 



□ 



7.4 The QR Algorithm 

The QR algorithm was suggested by Francis in 1961 and is an iterative 
method for computing all eigenvalues and eigenvectors for arbitrary com- 
plex matrices. In applications, it is the most commonly used method for 
eigenvalue computations. Our presentation of the QR algorithm follows 
[62]. 

For motivation we first consider the power method introduced by von 
Mises in 1929 for finding the eigenvalue with largest modulus. 

Definition 7.16 A matrix A is called diagonalizable if there exists a non- 
singular matrix C such that C~ l AC is a diagonal matrix; i.e., A is similar 
to a diagonal matrix. 

Theorem 7.17 An n x n matrix A is diagonalizable if and only if it has 
n linearly independent eigenvectors. 

Proof. Assume that C~ l AC = D , where D = diag(Ai, . . . , A n ), is diagonal. 
Then Dej = A jej, j = l,...,n, with the canonical orthonormal basis 
ei, . . . , e n of <C n . This implies that the vectors Xj Cej , j = 1, . . . , n, are 
eigenvectors of A, since 

Axj = ACej — CDej — CXjej = A jXj. 
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The vectors x \ , . . . , x n are linearly independent because C is nonsingular 
and the ei, . . . , e n are linearly independent. 

Conversely, assume that xi,...,x n are n linearly independent eigenvec- 
tors of A for the eigenvalues Ai , . . . , A n . Then the matrix C = (x 1 , . . . , x n ) 
formed by the eigenvectors as columns is nonsingular, and we have that 

AC — (Axi , . . . , Axfi ) — (A1X1 , . . . , \jiXfij — CD , 
where D = diag(Ai, . . . , A n ). Hence C~ l AC — D. □ 

We order the eigenvalues of a diagonalizable n x n matrix A according 
to their absolute values and assume that 



I'M > I'M > | A3 1 > > |A n |; 

i.e., there is only one eigenvalue of maximal modulus. Starting from an 
arbitrary vector v$ G <D n we construct the sequence 

v v = A u v 0, 1/ = 1 , 2 ,..., 



by the successive iterations v v := Av v -\. Note that in order to avoid nu- 
merical overflow or underflow we need to scale after each step. Since the 
n linearly independent eigenvectors x \ , . . . , x n of A form a basis of C n , we 
can represent 

n 

vo = y^ctkXk, 

k=i 

whence 

n 

A" v 0 = \ v k x k 

k= 1 



follows. Scaling after each step by the factor 1/Ai leads to 



A*v 0 




Xk, 



and consequently 



A"v 0 

K 



a\X\ and 



iK+iiu 

ii^ib 



|Ai| 



as v ->• 00, provided that a\ ^ 0 . Of course, in principle, Ai cannot be 
used as a scaling factor, since it is not known. However, this is irrelevant, 
since the eigenvector is determined only up to multiplication by a complex 
constant; i.e., only the direction of the eigenvector is relevant. In practical 
computations, the condition a\ 7^ 0, i.e., vq ^ span{#2, • • • , #n}, will be 
automatically satisfied through roundoff errors. 
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The fact that we need to find only the direction of the eigenvectors 
motivates us to interpret the power method as a successive iteration of 
subspaces. For 



S := span{^o} and A v S = spanjA^uo} 

from the above we have that A U S -» span{xi}, v -» oo. More generally, 
we can choose any subspace S of dimension 1 < dim 5 < n and iterate 
A U S = {A u v : v G S}. 

Lemma 7.18 Let A be a diagonalizable n x n matrix with eigenvalues 



I Ai | > |A 2 | > • ■ • > |A n | 

and corresponding eigenvectors x\, x 2 , . . . , x n . Assume that for some m 
with 1 < m < n we have that |A m | > |A m +i| and define 

T := span{xi,...,a: m } and U := span{o; m 4 .i, . . . , x n }. 

Further , assume that S is a subspace of C n with dimension m satisfying 

snu = { 0 }. 

Then the orthogonal projections Pa^s and Pt of C n onto A U S and T, 
respectively , satisfy 



II Pa v s ~ Pt H 2 < M 



1 



v G IN, 



for some constant M ; i.e., the subspaces A v S converge to T. 



Proof. 1. First, we show that we can choose a convenient basis for S. 
Let 2 / 1 , . . . , y m denote a given basis of S. Then, for i = 1 , . . . , m, we can 
represent 

m 

Vj — 'y ^ bjkXk -f Vj , (7*5) 

k=l 

where Vj € U. We prove that the m x m matrix B = ( bjk ) is nonsingular. 
To accomplish this, assume that ai, . . . , a m solve the homogeneous adjoint 
system 

m 

Yh b i k0L i = 0 ) k = l,...,m. 
i = 1 

Then from (7.5) it follows that 



^2 a iVj = 



3= 1 3= 1 
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and from this, with the aid of S(lU = {0} and the linear independence of the 
i/j, we conclude that aq = • • • = a m — 0. Hence, B indeed is nonsingular. 
We denote the entries of the inverse of B by B~ l = ( cjk ). Then 

m 

z j : =^2 c jkVk, j = 
k= 1 

defines a new basis for S of the form 

Zj — Xj I Wj , 

where Uj € U for j = 1, . . . , m. Because of 

A v Zj = A V jXj + A u Uj, j = 1, . . . , m, 

the linearly independent vectors 

, 4 % , 

Wjv := -tit- = > J = 1, • • • , 

form a basis of Since we can represent any u G U in the form 

n 

U — ^ ^ Q-k^k ? 

A;— m+1 

from 

n 

A v u - <*kK x k 

k=m- 1-1 

we conclude that there exists a constant L > 0 such that 

IKv - Xj\\ 2 < L , j = l,...,m, (7.6) 

'Vn 

for all v £ IN. 

2. By Corollary 3.53, the orthogonal projection of an element rj E C n onto 
the subspace T is given by 

m 

Prr] = Yl a k x k’ (7.7) 

A=1 

where the coefficients aq, . . . , a m solve the normal equations 

m 

Y a k( x k,Xj) = (T),Xj), j — 1, ... ,m. (7.8) 

k = 1 

Analogously, we have 

m 

Pa-SV = Y PkvWkv 

k=l 



(7.9) 
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and 

m 

'Y^0ku{w kv ,w jv ) = (T),w ju ), j = 1, . . . ,m. (7.10) 

k = 1 

We denote the mxm matrices of the linear systems (7.8) and (7.10) by X 
and W v , respectively. Then, with the aid of the Cauchy-Schwarz inequality, 
(7.6) implies that 



||W^-X|| 2 <C'i 



'Vn+l 



V G IN, 



(7.11) 



for some constant C\. We denote the right-hand sides of (7.8) and (7.10) by 
a and b u , respectively. Again from (7.6) and the Cauchy-Schwarz inequality 
we have that 



||6, - a|| 2 < C 2 




2 , ^ € IN, 



(7.12) 



for some constant C 2 . Now, considering the linear system (7.10) as a per- 
turbation of (7.8), from Theorem 5.3 we can conclude that 



II ft, - a|| 2 < C 3 



^m +1 



v e in, 



(7.13) 



for the vectors a = (aq, . . . , a m ) T and = (/?!„, . . . , /? mi/ ) T and some 
constant C 3 . From (7.7) and (7.9), using (7.6), (7.13), and the triangle in- 
equality, the assertion of the lemma follows. □ 



The subspace T of Lemma 7.18 is invariant with respect to A\ i.e., 
A(T) — T. By a knowledge of invariant subspaces the eigenvalue prob- 
lem for the full matrix A can be reduced to eigenvalue problems for two 
smaller matrices. Assume that 



P = (PuP*) 



is a unitary matrix such that its first m columns represented by the matrix 
Pi form a basis of T. Then P 2 AP 1 = 0, since T is invariant with respect 
to A, and P£ Pi = 0. Therefore, the unitary transformation yields 

P* AP -( PtAPi PiAP* \ _ Mu A 12 ) . 

V P 2 AP 1 P 2 AP 2 ) ~ V 0 A 22 J 

i.e., the eigenvalue problem for A is reduced to two smaller eigenvalue 
problems for the mxm matrix An and the (n-m)x(n — m) matrix A 22 . 

The successive iterations of Lemma 7.18 yield only approximations A v S 
to the invariant subspace T. However, if 



Qv — {Qlv,Q2v) 
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denotes a unitary matrix such that its first m columns represented by the 
matrix Q\ v form a basis of A" S, then for 



QIAQ v 



B21,v B22,u 



we expect that B 2 -> 0, v -> oo. Before we can establish this result we 
need to investigate further the iteration of subspaces. 

Choose a basis yi, . . . , y n of C n and consider the subspaces 



S m := span{yi,. .. ,y m ], m = 1 ,. .. ,n - 1 . 

For a simultaneous iteration of all the subspaces A v S m if clearly suffices 
to iterate the basis vectors A v y \, . . . ,A u y n . If the assumptions of Lemma 
7.18 are satisfied for each m — 1, . . . , n — 1, then 



A v S m -> T rn := span{xi, . . . i/ -► oo, 

for m = 1, . . . , n - 1. Hence we expect to be able to construct unitary ma- 
trices Q u such that QIAQ U R, v -» oo, where R is an upper triangular 
matrix that is similar to A . 

For the actual computation two difficulties arise. Firstly, the iterated 
vectors have to be scaled in order to avoid numerical overflow or un- 
derflow. Secondly, by Theorem 7.17, as v -» oo each of the n sequences 
(A^yi ), . . . , ( A v y n ) will converge to the subspace spanjxi} spanned by the 
eigenvector for the eigenvalue Ai with largest modulus. Hence, for large v 
the vectors A u yi , . . . , A v y n will be almost collinear; i.e., the basis elements 
A v y \, . . . , A u y n are almost linearly dependent and therefore ill-conditioned 
for spanning the iterated subspaces. 

Both these difficulties can be remedied by orthonormalizing the basis 
after each step. Assume that <7ii/, • . . , q n v are orthonormal vectors such 
that 

A u S m = span^i^, . . . , q mv }, m = 1, . . . , n - 1. 

Then we compute Aq \ u , . . . , Aq nv and orthonormalize these vectors from 
left to right to obtain the vectors ri„, . . . , r nv . This procedure preserves the 
property 



spanjri*,, . . . , r m „} = span {Aqi „, . . . , Aq mv ) 

- A(span{^i^, . . . , qku}) = A u+1 S m 

for m = 1 , . . . , n — 1 . 

Theorem 7.19 Assume that A is a diagonalizable nxn matrix with eigen- 
values 



|A 1 |>|A 2 |>...>|A n | 
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and corresponding eigenvectors X\,X 2 , . . . ,x n , and set 



Tfn l — spanjxi , . . . , Xm | and Um • — Sp8Jl{3Jj7i-)-i > • • • J } 

for m = 1, . . . , n — 1. Let q\o, . . . , q n o be an orthonormal basis of C n and 
let the subspaces 

Sm • — span{<7io, • • • > Qmo } 



satisfy 

Sm n U m — {0} , 771=1,.. .,71— 1. 

Assume that for each v € IN we have constructed an orthonormal system 
q \ v , . . . , q nu with the property 



A u Sm = spanjgi^, . . . , q m v}, m = 1, . . . , n - 1, (7.14) 



and define Q u = (qiv, • • • ,q n v)- Then for the sequence of matrices 
A v = (a jk ,v) given by 

A u+1 := QIAQ U (7.15) 

we have convergence: 



lim ajk,v = 0, 1 < k < j < n, 

V — KX) 



and 



lim ajj„ = Xj , j = 1, . . . , n. 

V — YOO 

Proof. 1. Without loss of generality we may assume that ||xj ||2 = 1 for 
j = 1, . . . , n. From Lemma 7.18 it follows that 

II iVs m - P Tm || 2 < Mr\ m = 1, . . . , n - 1, i/ € IN, (7.16) 

for some constant M and 



r 



max 

m=l,...,n— 1 



A 



m+1 



A 



m 



< 1. 



From this, for the projections 

Wmv •— P A v Sm'^rri') 771 = 1 , ... ,77 1 , 

and w nv := x n , we conclude that 

\\Wmv - Xmh < Mr V , 771 = 1, . . . , 77 , V € IN. (7.17) 

For sufficiently large v the vectors w \ v , . . . ,w ni/ are linearly independent, 
and we have that 



spanjwij,, . . . , w mu } = A u S m , 



777 = 1 , ... , 77 - 1 . 
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To prove this we assume to the contrary that the vectors wi Ul . . . , w nv are 
not linearly independent for all sufficiently large v. Then there exists a 
sequence vt such that the vectors w \ Vl , . . . , w nut are linearly dependent for 
each i € IN. Hence there exist complex numbers au , • • • , ot n i such that 

n n 

otktWkm = 0 and £ l««| 2 = 1- (7.18) 

k = 1 k = 1 

By the Bolzano- Weierstrass theorem, without loss of generality, we may 
assume that 

otkt <*k, £ oo, fc = l,...,n. 

Passing to the limit ^ -» oo in (7.18) with the aid of (7.17) now leads to 



^2a k x k =0 and ^|a fc | 2 = l, 



k = 1 



*:=! 



which contradicts the linear independence of the eigenvectors #i, . . . , x n . 
2. We orthonormalize by setting p\ := X\ and 

Pm • — %m ~ -Pr m _i 'm — 2, . . . , fl, 

Pm 



Pm : = 



Pm||2 



, m = l,...,n, 



and, analogously, i)i„ := and 



• — V) m v P A v Sm-i^mv > ^ — 2, . . . , 71 , 



%!/ -< 

• — 7T3 jT - 5 V(l — 1, . . . , Tl. 

||* W ||2 

Then 

span{pi, . . . ,p m } =T m , m = l,...,n-l, 
and by repeating the above argument, 

spanjui^, . . . , v mv ) - A v S m , m = 1 , . . . , n - 1 , 
for sufficiently large v. Writing 



(7.19) 



Pm ^mv — %m V) mt/ ( Pa u S m _ i 7 > j , m _ 1 )x m ~b Pa u S m-i {v)mv %m ) 5 

with the aid of (7.16) and (7.17) we obtain that 

H^miv Pm 1 1 2 ^ 3M T , ^ = 1, • • • , ^ JN* 

From this and the representation 



Vmv Pm 



Vmv ||Pm||2 ||^mid|2 . V m v Pm 

1 1 ~ Ti l 1 1 ~ n 



\\Vmv\\2 



Pm\\2 



Vm 2 
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it follows that 



||iw -Pmh < Cr v , m = 1, . . . ,n, v <E IN, (7.20) 

for some constant C. 

3. From (7.14) and (7.19), by induction, we deduce the existence of phase 
factors ip mv G C with |</w| = 1 such that 

Qmu — ^Pmv^mv •> ?Tl = 1, . . . , fl. 

Therefore, defining the diagonal matrices D v — diag^it,, . . . , (p ni/ ) and the 
unitary matrices V v — . . . , v n „), we have the relation 

D v V u = = Q*. 



This implies that 

a„+i = q:aq u = div;av v d v 

(7.21) 

= - F*)AV v D v + D* V P* A{V v - P)D„ + DIP*APD U , 

where P = (pi, . . . ,p n ). Because of (7.20) we have that 

IIK-p||2 = ||v;-p*|| 2 ->o, i/^oo. 

Furthermore, D*P* APD U is an upper triangular matrix with diagonal el- 
ements diag(Ai, . . . , A n ). Hence, the assertion of the theorem follows by 
passing to the limit 1 / -» oo in (7.21). We note that for the elements above 
the diagonal we do not, in general, have convergence because of the occur- 
rence of the phase factors. □ 

For the actual numerical implementation we have to describe the compu- 
tation of A v + 1 according to (7.15). From page 20 we recall that orthonor- 
malizing n vectors a\ , . . . , a n from left to right is equivalent to determining 
orthonormal vectors q \, . . . ,q n and an upper triangular matrix R = (rjk) 
such that 

k 

Cifc = ^ ^ Tikqii k = 1, . . . , Tl. 
i = 1 

For the matrices A = (ai , . . . , a n ) and Q = (qi , . . . , q n ) this corresponds to 
a QR decomposition 

A = QR 

as described in detail in Section 2.4. Now assume that A v — Q*_ X AQ U - \ 
has been determined according to (7.15). To generate A v + 1 from this, a 
QR decomposition of the matrix AQ v -\ is required, since 

A S m — AA Sm — span-fAt/i^-i , . . . , Aq myV —i }. 
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This is obtained from a QR decomposition 

A„ = Q u Ru (7.22) 

of A v by 

AQ V — \ — Q v — i A p — Qi/—\Qi/Ri/ — Qu j Ri/i 
where Q v — Q V -\Q V . Prom this we find that 

Av+i = QtAQ u = QIA V Q U = R V Q V . (7.23) 

Hence the two equations (7.22) and (7.23) represent one step of the succes- 
sive iterations of subspaces as described in Theorem 7.19. 

Now the QR algorithm consists in performing these iterations starting 
from the canonical basis ei, . . . ,e n , which means that in the first step a 
QR decomposition is required for A\ = A = (Ae i, . . . , Ae n ). 

Theorem 7.20 (QR algorithm) Let A be a diagonalizable matrix with 
eigenvalues 

|Ai| > |A 2 | > • • • > |A n | 

and corresponding eigenvectors xi,# 2 , • • • , awrf assume that 

span{ei,...,e m }nspan{x m+ i,...,x n } = {0} (7.24) 

for m — 1, . . . , n — 1. Starting with A\ = A, construct a sequence (A u ) by 
determining a QR decomposition 

A u = Q U R U 



and setting 
for v — 0, 1,2, . . 



Af/-\~ i i — RvQv 

Then for A v — ( (ijk,u ) we have convergence: 
lim djk,v =0, 1 < k < j < n, 

v—too 



and 



lim fljo i/ ~ — Aj j j — 1 ^ . . . y rt . 

V-+OQ ’ 

Proof. This is just a special case of Theorem 7.19. 



□ 



We proceed with a discussion of the assumption (7.24). Define the ma- 
trices X (# 1 , . . . , x n ) and Y := X -1 = ( yjk ). Then the identity I = XY 
means that 

n 

Cj ^ ^ X k ykj 5 j 1 , . . . , 71 . 
k = 1 
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For fixed m = 1, . . . , n — 1 the property (7.24) holds if and only if 

m 

^ ^ &jCj £ span{x m -j_i , ... , 

3 - 1 

implies that oq = • • • = a m = 0. This in turn is satisfied if and only if the 
homogeneous linear system 

m 

ykj^j — 0) k — I?***? 

3 = 1 

admits only the trivial solution, since 

k n m 

y! a j e j — ^ * 

i=i *=i j = i 

Hence (7.24) holds if and only if for m = 1, ... ,n — 1 the m x m sub- 
matrices (?/A:j), k,j = are nonsingular. This means that for the 

matrix F, Gaussian elimination works without interchanging columns; i.e., 
the matrix F has an LR decomposition. Since Gaussian elimination with 
column pivoting always works, there exists a permutation matrix P such 
that we have an LR decomposition PY = LR (see Problem 2.16). Hence 
it is plausible that the assumption (7.24) is not very restrictive. Indeed, 
it can be shown that convergence of the QR algorithm also holds when 
(7.24) is not satisfied. However, in general, the eigenvalues on the diagonal 
will not occur ordered according to their size (see [65]). Furthermore, it 
can be shown that in the case of eigenvalues with the same modulus, the 
QR algorithm still works in the sense of an appropriately modified version 
of Theorem 7.20. For example, for two conjugate complex eigenvalues, the 
upper rectangular matrix will be distorted through a two-by-two block on 
the diagonal. The blocks do not converge, but still the conjugate complex 
eigenvalues can be obtained as eigenvalues of the individual two-by-two 
blocks (see [65]). 

In principle, the QR decomposition required in each step of the QR 
algorithm can be done through the Gram-Schmidt procedure. However, in 
practice, because of the ill-conditioning of the Gram-Schmidt procedure, 
orthogonalizing by Householder transformations is preferable. For details 
we refer back to Section 2.4. 

The basic form of the QR algorithm as described above is not yet efficient 
enough for applications, since each iteration step requires 0(n 3 ) operations. 
The speed of convergence is determined by the location of the eigenvalues 
with respect to one another. The matrix A — a I has the eigenvalues A j — a 
for j = 1, . . . , n. If we choose for a an approximate value of the eigenvalue 
A n of smallest absolute value, then A n — a becomes small. This will speed 
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up the convergence in the last row of the matrix, since 



Having reduced the elements of the last row to almost zero, the last row 
and column of the matrix may be neglected. This means that the smallest 
eigenvalue is deflated by canceling the last row and column, and the same 
procedure can be applied to the remaining (n — 1) x (n — 1) matrix with the 
parameter a changed to be close to A n _i. This so-called shift and deflation 
strategy leads to a tremendous speeding up of the convergence. For details 
we refer to [27, 65]. 

The computational costs of one step of the QR algorithm is reduced when 
the matrix has a large number of zero entries. For example, for tridiagonal 
matrices all matrices generated in the QR algorithm remain tridiagonal. In 
the following section we will consider so-called Hessenberg matrices, which 
differ from upper triangular matrices only by a non-zero first subdiagonal. It 
can be shown (see Problem 7.16) that the Hessenberg form is also invariant 
with respect to the QR algorithm. Hence, for practical computations it is 
convenient first to transform the matrix into Hessenberg form. 

In general, comparing the computational costs, for symmetric matrices 
the QR algorithm is superior to the Jacobi method. However, the actual 
programming for the Jacobi method is very simple as compared with the 
QR algorithm. Hence for small matrix size n the Jacobi method is still 
attractive. 



7.5 Hessenberg Matrices 

Definition 7.21 An nxn matrix B — ( bjk ) is called a Hessenberg matrix 
if bjk = 0 for 1 < k < j — 2, j = 3, . . . , n; i.e., in the lower triangular 
part of a Hessenberg matrix only the elements of the first subdiagonal can 
be different from zero. 

We proceed by showing that each matrix A can be transformed into 
Hessenberg form by unitary transformations using Householder matrices. 
We start with generating zeros in the first column by multiplying A from 
the left by a Householder matrix Hi . We write 

A =(X i). 

where A is an (n — 1) x (n — 1) matrix and a\ an (n — 1) vector. Then 
considering a Householder matrix Hi of the form 
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where H\ — I — 2v\vl is an (n — 1) x (n — 1) Householder matrix, we have 

** = ( r AB; ) 



and 



HiAH l ( Hid; H x AHl ) ' 



As shown in the proof of Theorem 2.13, choosing 

Ui 



Vi 






where 

Ui = Oi =f<t(1,0,...,0) t 

and 

-r^-iy/aldi , a 2 1 ^ 0, 

«21 

a = < 

< yja>i0> l , «2i = 0? 

eliminates all elements of a i with the exception of the first component. 
Hence the first column of the transformed matrix is of the required form. 
Now assume that Ak is an n x n matrix of the form 



Ak 



( B k * \ 

\ 0 A n —k J 



where Bk is a k x k Hessenberg matrix, A n -k an (n — k) x (n — k) matrix, 
dk an (n — k ) vector, and 0 the (n — k) x (k — 1) zero matrix. Then for a 
Householder transformation of the form 



H k 




~° 1 
H n -k J ’ 



where I k denotes the k x k identity matrix and H n -k is an (n — k) x (n — k) 
Householder matrix, it follows that 

AkH * k = ( 0 Ik A n - k H* n _ k ) 

and 

HkAkHi = ( 0 Hn-k'a k H n -kA n - k H*_ k 

Now, proceeding as above, we can choose H n -k such that all elements of 
H n -kQ>k vanish with the exception of the first component. This procedure 
reduces a further column into Hessenberg form. We can summarize our 
analysis in the following theorem. 
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Theorem 7.22 To each n x n matrix A there exist n — 2 Householder 
matrices H \ , . . . , if n _ 2 such for Q = H n - 2 • • • Hi the matrix 

B = Q*AQ 



is a Hessenberg matrix. 

For a Hessenberg matrix the value of the characteristic polynomial and 
its derivative at a point A G C can be computed easily without computing 
the coefficients of the polynomial. These two quantities are required for 
employing Newton’s method for approximating the eigenvalues as the zeros 
of the characteristic polynomial. We first consider the case of a symmetric 



C 2 \ 

C 3 

C 3 03 C 4 

c n— 1 Q>n — 1 c n 

Cn / 

be a symmetric tridiagonal matrix. Denote by Ak the k x k submatrix con- 
sisting of the first k rows and columns of A, and let pk denote the charac- 
teristic polynomial of Ak . Then we have the recurrence relations 

PkW = (ak ~ A)p*_i(A) -c 2 k p k - 2 (X), k = 2,...,n, (7.25) 



Hessenberg matrix. 

Example 7.23 Let 

A = 



( 0,1 
C2 



and 



p' k (X) = (a k - X)p' k _i(X) - clp' k _ 2 (X) - p k -i(X), k = 2 (7.26) 



starting with po(A) = 1 and pi(X) = a\ — A. 

Proof. The recursion (7.25) follows by expanding det(^4fc — XI) with respect 
to the last column, and (7.26) is obtained by differentiating (7.25). □ 

Example 7.24 The n x n tridiagonal matrix 



2 


-1 






-1 


2 


-1 






-1 


2 


rH 

1 








-1 



V -12/ 

has the eigenvalues 

\ . o J7T . 

Ai=4sm 25T7T) ' , = 1 " 
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(see Example 4.17). Table 7.1 gives the results of the Newton iteration using 
(7.25) and (7.26) for computing the smallest eigenvalue A m i n = Ai and the 
largest eigenvalue A max = A n for n — 10. The starting values are obtained 
from the Gerschgorin estimates |A — 2 | < 2 following from Theorem 7.7. □ 

TABLE 7.1. Hessenberg method for Example 7.24 



Amax 


Amin 


4.00000000 

3.95000000 

3.92542110 

3.91933549 

3.91898705 

3.91898595 

3.91898595 


0.00000000 

0.05000000 

0.07457890 

0.08066451 

0.08101295 

0.08101405 

0.08101405 



We conclude this section by describing the computation of the quotient 
of the value of the characteristic polynomial p(A) = det(# — A I) and its 
derivative for a general Hessenberg matrix B = ( bjk ). We assume that 
bjj-i 7 ^ 0 for j — 2 , . . . ,n; i.e., B is irreducible (see Problem 7.15). For a 
given A we determine 

* = £(A) = (£l,-.-,*n) r 

and a = a(A) such that 

(bll ~ A)£i + & 12&2 + • * • + &ln£n — 

& 2 l£l 4- (&22 “ A )^2 4- * * * 4- £> 2 n£n = 0, 



bn,n— l£n — 1 4- ( b nn A)£ n — 0, 

and £ n = 1. This is an n x n upper triangular linear system for the n 
unknowns a,£i, . . . ,£ n -i, and it can be solved by backward substitution. 
Setting 





f b\\ — A 612 


b\,n— 1 


a 


\ 


C = 


&21 ^21 A . 


• ^ 2 ,n — 1 


0 






1 


bn,n— 1 


0 


/ 



by Cramer’s rule we have that 




148 



7. Matrix Eigenvalue Problems 



that is, 

p(A) = (-l) n - 1 6 21 --.ft n>B _ 1 a(A). 

Differentiating the last equation yields 

p'(\) = (-l) n ~ 1 b 21 ---b n , n ^a'(X), 

and therefore 

pW _ 

p'(A) a'(A) ■ 

By differentiating the above linear system with respect to A we obtain 
the linear system 



(hi - A)77i -f bi2T)2 + 

b2iVi + (&22 — A ) r /2 + 



+ &l,n-l%-l — + 0, 

+ &2,n-l%-l “ £2 5 



bn,n— lVn — 1 — 

for the derivatives 0 = a', 771 = This linear sys- 

tem again can be solved by backward substitution for the n unknowns 
/3, 7]i , . . . , r] n - 1 . Thus we have proven the following theorem. 

Theorem 7.25 Let B = ( bjk ) be an irreducible Hessenberg matrix and let 
A E C. Starting from = 1, f] n = 0, compute recursively 



£n—k — 



^n- 



■fc+l,n— fc 



A£n-fc+l ^ 1 bn—k+ij^j > , 



j=n—k+l 



Wn—k 



1 



bn— fc+l,n— A; 

/or k = 1 , . . . , n — 1 and 



£n—k + 1 “h A77n— fc+1 ^ ^ ^n— 

j=n— fc-f-l 



n 

a = ~A£i 4- ^ bij£j, 
i = i 



n 

0 = - Api + ^ bijTjj. 

3 = 1 



T/ien /or the characteristic polynomial of B we have 

P( A) _ a 
P'(A) 0 ' 
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Problems 

7.1 For the eigenvalues (repeated according to their algebraic multiplicity) of 
an n x n matrix A show that 

n n 

tr A = ^ A j and det A = JJ A j . 

j = i j =1 

7.2 For Example 7.1 show that in the case p = 0 the eigenvalues of the matrix 
A converge to the eigenvalues of the differential operator D as n — ► oo. 

7.3 Show that the eigenvalues of the adjoint matrix A* are the complex conju- 
gate of the eigenvalues of the matrix A. 

7.4 Show that for the eigenvalues of Hermit ian matrices the geometric and the 
algebraic multiplicities coincide. 

7.5 Use Gerschgorin’s Theorem 7.7 to determine the approximate location of 
the eigenvalues of the matrix 




To check the estimates, compute the eigenvalues by finding the zeros of the char- 
acteristic polynomial. 

7.6 Let A be a diagonalizable n x n matrix with eigenvalues Ai, . . . , A n , B an 
n x n matrix, and A an eigenvalue of A + B. Show that 

min |A-A j |<||C|| P ||C'- 1 || P ||B|| P , 

j = l,...,n 

where C is a nonsingular matrix such that C~ l AC is diagonal and p = 1, 2, oo. 

7.7 Show that the Frobenius norm is indeed a norm on the linear space of 
matrices. 

7.8 Write a computer program for the Jacobi method and test it for various 
examples. 

7.9 Assume that A is a real symmetric n x n matrix with eigenvalue A of 
multiplicity n — 1 and a further eigenvalue p ^ A. Show that A = A/-b(/i — \)xx* , 
where x*x = 1 and that by at most n — 1 Jacobi transformations A becomes 
diagonal. 

7.10 Show convergence of the cyclic Jacobi method with threshold [iV(A)] 2 / (2n 2 ). 

7.11 Let A be a diagonalizable n x n matrix with eigenvalues Ai, . . . , A n and 
eigenvectors xi, . . . ,x n , and assume that |Ai| > | A 2 1 > | A 3 1 > • • • > |A n |- Starting 
from vq £ C n with vo 0 span{x 2 , . . . , x n } show that the sequence 

Av u 



tV+i := 



v = 0,1,2, , 
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is well-defined and that the sequence of Rayleigh quotients 

_ (Av v ,v„) 

' IMI1 ’ 

satisfies the estimate 

\R U ~ Ai| < C r u , 
for some constant C > 0 and r := | A 2 / Ai |. 

7.12 The matrix 

/ ! 2 ! \ 

A — I 1.5 1 1.5 

\ 2 0.5 0.5 / 

has eigenvalue A = 4 with eigenvector x = (1,1, 1) T . Construct a Householder 
matrix H such that 

( 4 * * \ 

0**1 
0 * * / 

and determine the remaining eigenvalues. 

7.13 Write a computer program for the QR algorithm and test it for various 
examples. 

7.14 Verify the numerical results of Table 5.2 for the Hilbert matrix. 

7.15 Show that Hessenberg matrices B = (bjk) with bjj-i ^ 0 for j = 2, . . . , n 
are irreducible. 



i/ = 0,l,2,..., 
* = 0 , 1 , 2 ,..., 



7.16 Show that the Hessenberg form of a matrix is preserved by the QR algo- 
rithm. 



7.17 Show that the number of multiplications required for the transformation 
of a matrix into Hessenberg form via Householder transformations according to 
Theorem 7.22 is 5n 3 /3 -I- 0(n 2 ). 

7.18 Write a computer program for transforming a matrix into Hessenberg form 
via Householder transformations according to Theorem 7.22. 

7.19 Discuss Newton’s method for the solution of Ax = Ax, x T x = 1 in the 
neighborhood of a simple eigenvalue of a real symmetric matrix A. 

7.20 Prove the inequality 

n / I \ 1/2 

3 = 1 

for the eigenvalues of an n x n matrix A (see Corollary 7.9 and [41]). 
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Interpolation 



Polynomials have attracted the attention of mathematicians for centuries 
because of their many beautiful properties. For numerical purposes they 
have the advantage that their computation reduces to additions and mul- 
tiplications only. Therefore, it is quite natural to use polynomials for the 
approximation of more complicated functions. A classical approach to spec- 
ifying the coefficients of a polynomial of degree n is to prescribe that its 
values at n + 1 distinct points coincide with those of the function to be 
approximated. The development and investigation of such interpolation 
polynomials has a long mathematical history, beginning with the use of 
the method of interpolation to tabulate the logarithms, as proposed by 
Briggs in the early seventeenth century. 

It is the purpose of the first section, Section 8.1, of this chapter to intro- 
duce the classical theory of polynomial interpolation, including discussions 
on the effective numerical computation of interpolation polynomials and an 
analysis of the resulting approximation error. The next section, Section 8.2, 
describes the corresponding theory for the interpolation of periodic func- 
tions by trigonometric polynomials. For a detailed study of the foundations 
of classical interpolation theory we refer to [16]. 

In the last two sections, Sections 8.3 and 8.4, we proceed with a study 
of interpolation by splines, i.e., piecewise polynomial interpolation, which 
was developed within the last fifty years and has turned into a successful 
tool in approximation theory and other parts of numerical analysis. For a 
comprehensive study of spline functions we refer to [18, 53], and for their 
use in computer-aided geometric design we refer to [23]. 
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We would like to point out that interpolation is not only important as 
a tool for the approximation of functions that are difficult to compute or 
whose values are known only at discrete points. It also serves as an essential 
ingredient for developing numerical integration rules and methods for the 
approximate solution of differential and integral equations, as we shall see 
in the following chapters. 



8.1 Polynomial Interpolation 

For nElNU{0},we denote by P n the linear space of polynomials 

n 

p( x ) = ^ a k x k 
k = o 

for a real (or complex) variable x and with real (or complex) coefficients 
ao, . . . ,a n . A polynomial p € P n is said to be of degree n if a n ^ 0. In 
this chapter, we consider P n as a subspace of the linear space C [a, b\ of 
continuous real- (or complex-) valued functions on the interval [a, 6], where 
a < b. For m E IN we denote by C m [a,b\ the linear space of m times 
continuously differentiable real- (or complex-) valued functions on [a, 6]. 

We recall the following basic uniqueness property of algebraic polynomi- 
als as part of the fundamental theorem of algebra. Since we will use this 
property frequently, it is appropriate to include a simple proof by induction. 

Theorem 8.1 Forn E INU{0}, each polynomial in P n that has more than 
n (complex) zeros , where each zero is counted repeatedly according to its 
multiplicity , must vanish identically; i.e all its coefficients must be equal 
to zero. 

Proof. Obviously, the statement is true for n = 0. Assume that it has been 
proven for some n > 0. By using the binomial formula for x k = [(x-z)-\-z] k 
we can rewrite the polynomial p E P n + 1 in the form 

n+ 1 

p(x ) = ^ 6/fe(x - z) k + b 0 

k = 1 



with the coefficients i>o, &i, • • • , bn+i depending on ao, a\ , . . . , a n +i and z. 
If z is a zero of p, then we must have bo — 0, and this implies that 
p(x) — (x — z)q(x) with q E P n > Obviously, q has more than n zeros, 
since p has more than n + 1 zeros. Hence, by the induction assumption, q 
must vanish identically, and this implies that p vanishes identically. □ 

Theorem 8.2 The monomials Uk{x) := x k , k = 0, . . . , n, are linearly in- 
dependent. 
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Proof. In order to prove this, assume that 

n 

Y akUk = 0 , 

fc =0 



that is, 

n 

^ a kX k =0, x € [a, 6]. 
k—o 

Then the polynomial with coefficients ao,ai, . . . ,a n has more than n dis- 
tinct zeros, and from Theorem 8.1 it follows that all the coefficients must 
be zero. □ 

The linear independence of the monomials uo,...,u n implies that they 
form a basis for P n and that P n has dimension n 4-1. 

Theorem 8.3 Given n -f 1 distinct points x$,. . . ,x n E [a, b] and n T 1 
values i/o, . . . ,y n G IR, there exists a unique polynomial p n E P n with the 
property 

Pn(xj) = yj, j = 0,...,n. (8.1) 

In the Lagrange representation, this interpolation polynomial is given by 

n 

Pn = Y V^ k ( 8 - 2 ) 

0 



with the Lagrange factors 

n 

*k(x) = n 



X — Xi 



i=0 



Xfo X{ 



, k = 0, . . . ,n. 



Proof. We note that Ik € P n for k = 0, . . . , n and that the equations 



ik(xj)-6 jk , j,k = 0,...,n, (8.3) 

hold, where fijk — 1 for k = j , and $jk — 0 for k / j. It follows that p n 
given by (8.2) is in P n , and it fulfills the required interpolation conditions 
Pn{xj) = Vi, j = 0,...,n. 

To prove uniqueness of the interpolation polynomial we assume that 
Pn,i j Pn,2 G P n are two polynomials satisfying (8.1). Then the difference 
Pn Pn, i ~ Pn, 2 satisfies Pn(xj) — 0, j = 0, ...,n; i.e. , the polynomial 
Pn € Pn has n + 1 zeros and therefore by Theorem 8.1 must be identically 
zero. This implies that p n ,\ = p n , 2 - n 



The representation (8.2), which was discovered by Lagrange in 1794, is 
very convenient for theoretical investigations because of its simple struc- 
ture. However, for practical computations it is suitable only for small n. For 
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n large the Lagrange factors become very large and highly oscillatory, which 
causes ill-conditioning of the Lagrange interpolation polynomial. Already 
in 1676, in his study of quadrature formulae (see Theorem 9.3), Newton 
had obtained a representation of the interpolation polynomial that is more 
practical for computational purposes. For its description we need to give 
the following definition. 

Definition 8.4 Given n -f 1 distinct points xo, . . . ,x n £ [a, b] and n + 1 
values • • • ? 2/n € the divided differences D k of order k at the point 
Xj are recursively defined by 

Dj : =Vj, j = 0,...,n, 

D k ~ l — D k ~ l 

Dh ■- —ttl 1 — j j = 0,...,n-k, k = l,...,n. 

We notice that the points xo,. . . ,x n need not be in ascending order. It 
is convenient to arrange the divided differences according to the tableau 

xo 2/0 - £>o 

xx yi = Dl 

D\ Dl 
x 2 2/2 = D° D ? 

D\ 

X3 2/3 = D% 

which we illustrate by the following example. Obviously, for the full tableau 
the computational cost is of order 0(n 2 ). 

Example 8.5 For the points xo = 0, X\ — 1, X 2 = 3, x\ =4 and the 
values 2/0 = 0, 2/1 — 2, 2/2 = 8, 2/4 = 9 the tableau of the divided differences 
is given by 

0 0 

2 

1 2 1/3 

3 -1/4 

3 8 -2/3 

1 

4 9 

Each value D k in the fcth column is obtained by taking the difference of 
the two neighboring values and D k ~ x in the preceding column and 

dividing it by the difference Xj+k ~ Xj of the points Xj + k and xj. □ 
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Lemma 8.6 The divided differences satisfy the relation 



j+k j+k 



j = 0, . . . , n — fc, k = 1, . . . , n. (8.4) 



m=j 



i—j 



Proof. We proceed by induction with respect to the order k. Trivially, (8.4) 
holds for k = 1. We assume that (8.4) has been proven for order k — 1 for 
some k > 2. Then, using Definition 8.4, the induction assumption, and the 
identity 



x j+k 

we obtain 
D k , = 



— — { — M = 

j+k Xj L x m x j+k x m Xj J 



(x m x j+k)( x m x j) 



1 



x j+k x j 



j+k j+k 



jT» jT«- ^ j + k — 1 j + k — 1 

e *. n e»« n r 4 



m=j + 1 i—j + 1 

i^m 



x m Xi 



= — ei — I"' y - { — r if — 

x j+k x j _ii l x m x j+k x m x j J x m x i 

Ul — J i J. *1 - 'i -l- 1 



m—j i—j 

i^m 



j+k - 1 



x m X{ 



i=j + 1 
i^m 



j+k - 1 



j+fc 



j+fc j+fc 






~ Xi 

i=3 



+ Vj 



J ' i J J i 

II = Vm II ~ rr: ; 



Xj — X 
i=J + 1 J 



i— j 



i.e., (8.4) also holds for order k. □ 

Theorem 8.7 In the Newton representation, for n > 1 the uniquely de- 
termined interpolation polynomial p n of Theorem 8.3 is given by 



n k — 1 

Pn(x) = y 0 + II( X " Xi )‘ ( 8 - 5 ) 

k = 1 i = 0 

Proof. We denote the right-hand side of (8.5) by p n and establish p n = p n 
by induction with respect to the degree n. For n — 1 the representation 
(8.5) is correct. We assume that (8.5) has been proven for degree n — 1 for 
some n > 2 and consider the difference d n := p n — Pn- Since 



71—1 

d n (x) =p n {x) -p n -i(x) - D J JJ (x - Xi), 

7—0 
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as a consequence of Theorem 8.3 and Lemma 8.6 the coefficient of x n in the 
polynomial d n vanishes; i.e., d n G P n - 1 - Using the induction assumption, 
we have that 

Pn-l(Xj) = l lj = Pn(Xj), j = 0, . . . ,n - 1, 

and therefore 

d n {xj) =0, j = 0, . . . ,rc - 1. 

Hence, by Theorem 8.1 it follows that d n = 0, and therefore p n — Pn- a 

Example 8.8 The interpolation polynomial corresponding to Example 8.5 
is given by 



p 3 (x) = 2x + -x(x 



1 )- 7 *(* -1 )(* -3 )- 



□ 



Analogously to the Horner scheme (see (6.10)), the value of the Newton 
interpolation polynomial at a point x can be obtained by nested multipli- 
cations according to 



Pn(x) - a n (x - x 0 )(x - Xi) • • • (x - x n -i ) H h ai(x - x 0 ) + a 0 



= (. . . (a n (x - x n _i) + a n -i)(x - x n - 2 ) H 1- ai)(x ~ %o) + a 0 

by 0(n) multiplications and additions. For an evaluation of the interpola- 
tion polynomial at a single point x without explicitly computing the coef- 
ficients of the polynomial, the following Neville scheme is very practical. 
From the formal coincidence of the recursion (8.6) and Definition 8.4 for 
the divided differences, it is obvious that the computations for (8.6) can be 
arranged in a tableau analogous to the tableau for the divided differences. 

Theorem 8.9 Given n + 1 distinct points Xo, - - - ,x n G [a, b] and n + 1 
values 2/oj* ••>?/« € IR, the uniquely determined interpolation polynomials 
Pi € Pk, i = 0, . . . , n — fc, k = 0, . . . , n, with the interpolation property 

Pi( x j)=Vj . j = i,...,i + k, 



satisfy the recursive relation 



p°ii x ) = Vh 



Pi( x ) 



(x - Xj)p k i+ l(x) - (x - X i+k )Pi 



( 8 . 6 ) 



x i+k x i 



k = 1, . . . ,n. 



Proof. We again proceed by induction with respect to the degree k. Obvi- 
ously, the statement is true for k = 1. Assume that the assertion has been 
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proven for degree k — 1 for some k > 2. Then the right-hand side of (8.6) 
describes a polynomial p G P*, and by the induction assumption we find 
that the interpolation conditions 



p{Xj) = 



(Xj - Xj)yj - (Xj - x i+k )yj 
%i+k ~ x i 



= Vj, 



j = i + 1, . . . ,i + k - 1, 



as well as p(xi) = yi and p(xi+k) = Vi+k are fulfilled. 



□ 



The main application of polynomial interpolation consists in the approx- 
imation of continuous functions / : [a, b] — ► IR. In this case, given n + 1 
distinct points a?o, . . . , x n G [a, 6], by 

L n : C[a, b] -* P n 

we denote the interpolation operator that maps the function / G C[o, b] 
onto its uniquely determined interpolation polynomial L n f G P n with the 
property 

(L n f)(xj) = f(xj), j = 0, . . . ,n. (8.7) 

From the Lagrange representation (8.2) it can be seen that the operator 
L n is linear and bounded (see Problem 8.4). Moreover, since L n p = p for 
all p G P n , the interpolation operator is a projection; i.e., Lj; = L n . 

The interpolation polynomial L n f is used as an approximation for the 
function /, since in general, the polynomial L n f is better suited for com- 
putational purposes than the original function /. In the sequel we shall be 
concerned with estimating the approximation error f — L n f. 

Theorem 8.10 Let f : [a, 6] -» IR be ( n + 1 )-times continuously differen- 
tiable. Then the remainder R n f := f — L n f for polynomial interpolation 
with n -f 1 distinct points xq, • . . , x n G [a, b] can be represented in the form 



(Rnf)(x) = 



f {n+1) (0 

( n + 1 )! 



n 



n<*-*A 

j=0 



x G [a, 6], 



(8.8) 



for some £ G [a, 6] depending on x. 

Proof. Since (8.8) is trivially satisfied if x coincides with one of the inter- 
polation points xo, . . . , x n , we need be concerned only with the case where 
x does not coincide with one of the interpolation points. We define 

n 

q n +i(x) := JJ(x-Xj) 
j = 0 



and, keeping x fixed, consider g : [a, b] -> IR given by 



9{y) := f(y) - (L n f)(y) - q n +i(y ) 



f(x) - ( L n f)(x ) 
Qn+i(x) 



y e [a,b]. 
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By the assumption on /, the function g is also (n + l)-times continuously 
differentiable. Obviously, g has at least n- 1-2 zeros, namely x and a?o, . . . , x n . 
Then, by Rolle’s theorem the derivative g ' has at least n - f 1 zeros. Repeating 
the argument, by induction we deduce that the derivative has at least 

one zero in [a, 6], which we denote by £. For this zero we have that 



o = /("+!) (0-(n + l)! 



(Rnf)(x) 

9n+ 1 (®) 



and from this we obtain (8.8). 



□ 



The intermediate point £ in the error representation (8.8) is not, in gen- 
eral, known explicitly. Therefore, the interpolation error is estimated by 
the following corollary. 



Corollary 8.11 Under the assumptions of Theorem 8.10 we have the error 
estimate 

ll^/IU < (^T)T ll/ ( " +1> l|oo- 

Example 8.12 The linear interpolation is given by 

(Lif)(x) = 1 [/(x 0 )(xi - x) + f(xi)(x - x 0 )] 



with the step width h = x\ — xq. For the polynomial q 2 (x) = (x—xq)(x—Xi) 
we have that 



max 






M*)l = \ ■ 



Therefore, by Corollary 8.11, the error occurring in linear interpolation of 
a twice continuously differentiable function / can be estimated by 



I(i?i/)(x)| < — max \f"{y)\, *e[io,4 (8.9) 

o ye [xo,xi\ 



For example, the error in linear interpolation with step size h = 0.01 for 
the sine function is less than or equal to h 2 /S = 0.0000125. □ 

By the following examples we want to introduce the question of whether 
the interpolation polynomials converge when the number n - h 1 of inter- 
polation points, and hence the degree n of the interpolation polynomials, 
tends to infinity. 

Example 8.13 Let f(x) := sin a; and let Xo, • . • ,x n € [0, 7r] be n + 1 dis- 
tinct points. Since 

|/ (n+1) (x)| < 1, X 6 [0, 7r], 

and 

|gn+i(*)l < x e [0, 7r], 
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by Corollary 8.11, we have the estimate 

7T n+1 

\(R n f)(x)\ < ^ + 1 ), > * e [0,4 



Hence the sequence ( L n f ) of interpolation polynomials converges to the 
interpolated function / uniformly on [0,7r] as n -» oo. □ 

Example 8.14 A first detailed example of the insufficiency of polynomial 
interpolation even for analytic functions was investigated by Runge in 1901. 
He considered the simple function 



f{x) = 



1 

1 + 25x 2 



on the interval [—1,1] with equidistant interpolation points. He discovered 
that as the degree n tends to infinity, the interpolation polynomials diverge 
for 0.726 < \x\ < 1, whereas the approximation works satisfactorily in the 
central portion of the interval (see Problem 8.6). Although / is analytic 
in all of IR, its poles in the complex plane at ±i/ 5 are responsible for this 
divergence. □ 

Example 8.15 Consider the continuous function 

{ xsin — , x £ (0, 1], 
x 

0, x = 0. 

With the interpolation points chosen as 

Xj = j^-j- , j^0,...,n, 

we have that f(xj) = 0, j = 0, ...,n, and therefore L n f = 0 for all 
n. Hence, in this case the sequence ( L n f ) converges only at the points 
Xj, j £ IN U {0}, to the interpolated function /. □ 

These three examples illustrate that for polynomial interpolation both 
convergence and divergence are possible. We complement the examples by 
stating the following two theorems without detailed proofs. 

Theorem 8.16 (Marcinkiewicz) For each function f € C[a,b] there ex- 
ists a sequence of interpolation points (x^), j = 0, . . . ,n, n = 0, 1, . . . , 
such that the sequence ( L n f ) of interpolation polynomials L n f E P n with 
( L n f)(x = /(x^), j — 0, . . . ,n, converges to f uniformly on [a, b]. 

Proof. The proof relies on the Weierstrass approximation theorem and the 
Chebyshev alternation theorem. The Weierstrass approximation theorem 
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(see [16]) ensures that for each / G C[a,b\ there exists a sequence of poly- 
nomials p n G P n such that \\p n - < /|| 00 -» 0 as n -» oo. As a consequence of 
the Chebyshev alternation theorem from approximation theory (see [16]), 
for the uniquely determined best approximation p n to / in the maximum 
norm with respect to P n , the error p n — / has at least n 4-1 zeros in [a, b\. 
Then taking the sequence of these zeros as the sequence of interpolation 
points implies the statement of the theorem. □ 

Theorem 8.17 (Faber) For each sequence of interpolation points (x^) 
there exists a function f G (7 [a, b] such that the sequence ( L n f ) of interpo- 
lation polynomials L n f G P n does not converge to f uniformly on [a, b]. 

Proof. This is a consequence of the uniform boundedness principle, Theo- 
rem 12.7. It implies that from the convergence of the sequence ( L n f ) for 
all / G C[a, b] it follows that there must exist a constant C > 0 such that 
||L n ||oo < C for all n G IN. Then the statement of the theorem is obtained 
by showing that the interpolation operator L n satisfies ||Z/ n ||oo > clnn for 
all n G IN and some c > 0 (see [16]). □ 

We conclude this section by briefly describing Her mite interpolation, 
where in addition to the values of the polynomial, the values of its first 
derivative at the interpolation points are also prescribed. 

Theorem 8.18 Given n-f 1 distinct points xo , . . . , x n G [a, b] and 2n + 2 
values yo , . . . , y n £ H and y ' 0 , . . . , y' n G IR, there exists a unique polynomial 
P 2 n +1 € p 2 n-\-i with the property 

P2n+l(Xj) = l lj, P2n+l( X 3 ) = Vp j = 0,...,TI. (8.10) 

This Hermite interpolation polynomial is given by 

n 

P2n+1 = Y^[yk H ° k +y' k Hl} ( 8 . 11 ) 

k = 0 



with the Hermite factors 

Hl{x) := [1 - 2e' k (x k ){x - x fc )] [4 (x)] 2 , x) := ( x - x k ) [4 (a:)] 2 

expressed in terms of the Lagrange factors from Theorem 8.3. 

Proof. Obviously, the polynomial P 2 n+i belongs to i^n+i? since the Hermite 
factors have degree 2n + 1. From (8.3), by elementary calculations it can 
be seen that (see Problem 8.7) 

H° k ( Xj ) = H k '(xj) = S jk , 

j,k = 0, ...,n. (8.12) 

H° k '(xj) = H k (xj) = 0, 
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From this it follows that the polynomial (8.11) satisfies the Hermite inter- 
polation property (8.10). 

To prove uniqueness of the Hermite interpolation polynomial we assume 
that P2n+i,i> P 2 n+i ,2 £ P 2 n+i are two polynomials having the interpolation 
property (8.10). Then the difference P 2 n+i :=p 2 n+i,i ~P 2 n+i ,2 satisfies 

P2n+i (xj ) = p' 2n+1 (xj) = 0, j = 0, . . . , n; 

i.e., the polynomial P 2 n+i £ ^2n+l has n + 1 zeros of order two and there- 
fore, by Theorem 8.1, must be identically equal to zero. This implies that 

P2n+l,l = P2n+1,2- ° 

The main application of Hermite interpolation consists in the approxi- 
mation of a given function / G C l [a, b] by interpolating its function values 
and the values of its derivative at n ■+■ 1 distinct points xo,. . . ,x n G [a, 6]. 
By 

H n : C l [a, b] — » P 2 n+\ 

we denote the Hermite interpolation operator that maps continuously dif- 
ferentiable functions / : [a, b] JR into the uniquely determined Hermite 
interpolation polynomial H n f G ftn+i with the property 

(H n f)(xj) = f(xj), (HnfYixj) = f'(xj), j = 0, . . . , n. 

The following theorem can be proven analogously to Theorem 8.10 (see 
Problem 8.8). 

Theorem 8.19 Let f : [a, b\ -> IR be (2 n + 2 )-times continuously differ- 
entiable. Then the remainder R n f := f — H n f for Hermite interpolation 
with n + 1 distinct points x $, . . . , x n G [a, b] can be represented in the form 

f(2n+2W\ n 

(Rnf)(x) = ( 2n + 2 )j II <* XE ( 8 ‘ 13 ) 

for some £ G [a, b] depending on x. 



8.2 Trigonometric Interpolation 

In applications, quite frequently there occur periodic functions, i.e., func- 
tions with the property 



f(t + T) — /(f), t G IR, 

for some T > 0. For example, functions defined on closed planar or spatial 
curves always may be viewed as periodic functions. Polynomial interpola- 
tion is not appropriate for periodic functions, since algebraic polynomials 
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are not periodic. Therefore, we proceed by considering interpolation by 
trigonometric polynomials, which was first used independently by Clairaut 
(1759) and Lagrange (1762). Without loss of generality we assume that the 
period is equal to T — 2tt. 

Definition 8.20 For n £ IN we denote by T n the linear space of trigono- 
metric polynomials 

n n 

q(t) = ^ ak cos kt + ^ bk sin kt 

k—0 k = 1 

with real (or complex) coefficients ao , . . . , a n and b \, . . . , b n . A trigonomet- 
ric polynomial q £ T n is said to be of degree n if \a n \ + \b n \ > 0 . 

From the addition theorems for the cosine and sine functions it follows 
that £ T ni+n2 if q x € T ni and q 2 £ T n2 . This justifies speaking of 
trigonometric polynomials. 

Theorem 8.21 A trigonometric polynomial in T n that has more than 2 n 
distinct zeros in the periodicity interval [ 0 , 2tt) must vanish identically; i.e., 
all its coefficients must be equal to zero. 

Proof. We consider a trigonometric polynomial q E T n of the form 

n 

q(t) = ^ 4 - cos kt + bk sin kt]. (8.14) 

k= 1 

Setting bo = 0, 

Ik := ^ (a* - ib k ), 7 -k~^(a k +ib k ), fc = 0,...,n, (8.15) 

and using Euler’s formula 

e lt — cos t -h i sin t , 

we can rewrite (8.14) in the complex form 

n 

q(t) = E 7k • (8-16) 

k=—n 

Therefore, substituting z = e lt and setting 

n 

P(*) == E 7kZ n+k , 

k =—n 

we have the relation 

q{t) = z~ n p{z). 
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Now assume that the trigonometric polynomial q € T n has more than 2 n 
distinct zeros in the interval [0, 27r). Then the algebraic polynomial p € P^n 
has more than 2 n distinct zeros lying on the unit circle in the complex plane, 
since the function t e lt maps [0, 2n) bijectively onto the unit circle. By 
Theorem 8.1, the algebraic polynomial p must be identically zero, and now 
(8.15) implies that also q must be identically zero. □ 

Theorem 8.22 The cosine functions Ck(t) := cos kt, k = 0, 1, . . . ,n, and 
the sine functions Sk(t) := sin kt, k = 1 , . . . ,n, are linearly independent in 
the function space C[ 0, 27t]. 

Proof. To prove this, assume that 

n n 

^ ^ &kCk ^ ^ bk^k — 

k=zO k= 1 

that is, 

n n 

a>k cos kt + ^ bk sin kt = 0, t € [0, 2n\. 
k= 0 k= 1 

Then the trigonometric polynomial with coefficients ao , . . . , a n and b \ , . . . , b n 
has more than 2 n distinct zeros in [0, 27 t), and from Theorem 8.21 it follows 
that all the coefficients must be zero. Note that this linear independence 

also can be deduced from Theorem 3.17. □ 

Theorem 8.22 implies that the cosines c*, k — 0,1,..., n, and sines 
Sk , k — 1, . . . , n, form a basis for T n and that T n has dimension 2n -h 1. 

Theorem 8.23 Given 2n + 1 distinct points to , . . . , t^n € [0, 2tt) and 2n+ 1 
values 2/o,-..,2/2n € there exists a uniquely determined trigonometric 
polynomial q n £ T n with the property 

Qn(t j )=y j , j = 0,...,2n. (8.17) 

In the Lagrange representation, this trigonometric interpolation polynomial 
is given by 

2 n 

= (8.18) 

k= 0 



with the Lagrange factors 



2 n 

4 (t ) = n 



i=0 

i^k 



t-u 



sm 



. t k -U ’ 
sin 



k = 0, . . . , 2n. 



2 
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Proof. The function q n belongs to T n , since the Lagrange factors are trigono- 
metric polynomials of degree n. The latter is a consequence of 



. t -t 0 . 
sm — - — sin 
2 



t-h 

2 



1 - *o 
- cos — - — 

2 2 



- - cos 






i.e., each of the functions is a product of n trigonometric polynomials of 
degree one. As in Theorem 8.3, we have Ikixj ) — 3jk for j,k — 0 , . . . , 2 n, 
which shows that q n indeed solves the trigonometric interpolation problem. 

Uniqueness of the trigonometric interpolation polynomial follows analo- 
gously to the proof of Theorem 8.3 with the aid of Theorem 8.21. □ 



We now consider the important case of an equidistant subdivision 



tj — 



2 7Tj 

2n + 1 ’ 



j = 0 ,..., 2 n. 



For this we first note the summation formula 



2 n 

y-' e Mj 


2 n I 


[ 2n + l, 

1 


k = 0, 


(8.19) 


j= 0 


3= 0 1 


0 . 


k = ±1, . . . , ±2 n, 





which is a consequence of the fact that for e ltk / 1 we have the 
sum 



2 n 



E ei ^ = 

j = o 



l _ e i(2n+l)t k 

1 - e itk 



= 0, 



geometric 



whereas for e ltk — 1 each term in the sum is equal to one. 

We now attempt to find the uniquely determined interpolation polyno- 
mial in the complex form 



qn(t) = E 

k= — n 



From the interpolation conditions 



Qn(tj) = yj, j = 0,...,2n, 

we observe that solving the interpolation problem is equivalent to solving 
the system of linear equations 

n 

E^ e ^=%> j = 0 , . . . , 2n. (8.20) 

k= — n 

Assume that the coefficients 7^ solve (8.20). Then, with the aid of (8.19), 
we obtain 

2 n n 2 n 

E Vi e ~ imtj = E K E e i(k ~ m)t > = (2 n + l) 7m ; 

j — 0 k=—n j = 0 
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i.e., any solution of (8.20) must be of the form 



7 k 



1 

2n + 1 



2 n 



iktl ' 

3=0 



k — — n, . . . ,n. 



( 8 . 21 ) 



On the other hand, again with the aid of (8.19), for 7* given by (8.21) we 
have that 



n 1 2n 2n 

E •» <=“" = 2^ E E = W. i-o.-. 2 »i 

k——n m= 0 k=0 



i.e., the linear system (8.20) has a unique solution, which is given by (8.21). 
From this, using the relation (8.15) between the real representation (8.14) 
and the complex representation (8.16) of trigonometric polynomials, we 
derive the following theorem. 

Theorem 8.24 There exists a unique trigonometric polynomial 



n 

Qn(t) ~ ~7T + C ° S + ^ k Sin ^t] 

k = 1 



satisfying the interpolation property 



Qn 




= J 



0 , . . . , 2 n. 



Its coefficients are given by 



0>k 



b k 



2 

2n + 1 
2 

2 n -j- 1 



3 = 0 

2 n 

j = 0 



cos 



sin 



jk 

2n + l ’ 



2njk 
2n + l ’ 



fc = 0, . . . ,n, 



fc = 1, . . . ,n. 



For an equidistant subdivision with an even number 2n of interpolation 
points 

7T j 

tj — , j — 0, . . . , 2n — 1, 

we have only 2 n conditions to determine an element of the (2n + l)-dimen- 
sional space T n . However, since the function sin nt obviously has its zeros 
at the interpolation points, we drop it from the interpolation polynomial. 
The proof of the following theorem is completely analogous to the proof of 
Theorem 8.24. 
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1 2n— 1 .J 

i 1 v~^ • ^]k , 

bk = - > Vj sin , fc = 1,. . .,n - 1. 

n ' n 

Obviously, the trigonometric interpolation polynomials of Theorems 8.24 
and 8.25 may be viewed as discretized versions of the Fourier series, where 
the integrals giving the coefficients of the Fourier series (see Problem 3.20) 
are approximated by the rectangular quadrature rule at an equidistant grid 
(see Corollary 9.27). Therefore, trigonometric interpolation on an equidis- 
tant grid is also known as the discrete Fourier transform. 

An effective numerical evaluation of trigonometric polynomials can be 
done analogously to the Horner scheme for algebraic polynomials. For the 
polynomial 

n 

p(z) = Yl Ckzk 

k = 0 

the recursion (6.10) of the Horner scheme has the form 
bk — 1 bkZ Ck — 1 , k fly . . . , 1 , 

starting with b n = c n , and it delivers p(z) = bo . Assuming that the coeffi- 
cients Ck are real, we substitute z — e lt and separate into real and imaginary 
parts, bk — Uk + ivk, to obtain u n = c n , v n = 0, and the recursion 

Uk - 1 = Uk cos t - v k sinf + Ck- 1 , Vk - i = u k sin t + Vk cos f, 

for k = n — 1, . . . , 1. From this we find 

n n 

Uo = C k cos kty Vo = ^2 ° k s * n 

fc=l 

i.e., the evaluation of a trigonometric polynomial at a point t can be reduced 
to the evaluation of sinf and cos t and 0(n) additions and multiplications. 
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To compute all the coefficients a& and bk in Theorem 8.24 or 8.25 by this 
approach requires 0(n 2 ) additions and multiplications. 

By the fast Fourier transform , which is attributed to Cooley and Tukey 
(1965) and which was known already to Gauss, the computational costs 
can be reduced even further. The main idea is to exploit the symmetries of 
e 2 inj/n [f n i s a power of two, say n — 2 P with p E IN. We briefly explain the 
fast Fourier transform algorithm for the evaluation of the discrete Fourier 
transform in the complex form 

1 n— 1 

c fc = -J Z y i e -^ kj ' k = 0, . . . ,n — 1. (8.22) 

71 j = o 

Let m n/2 — 2 p ~ l and u := e“ 27r */ n . Then u n = 1, u m — —1, and 
(8.22) reads 

^ n— 1 

Ck = - k — 0, . . . ,n — 1. 

j=o 

Now, the basic idea of the Cooley-Tukey algorithm is to break this sum 
into two parts for j even and j odd; i.e., 

Ck = ^ Ik + ^ u k Sk, k = 0,...,n - 1, 

where 

m — 1 -j m — 1 

Ik := m H S k — ^2 J/ 2 j+iW 2jfc , fc = 0,...,n-l. 

j = o j = o 

Since J 1 — we have 7 *+ m — 7 k and 8k+ m — 8k, and therefore 

C* = 1 7fc + ^ Ck+m = \^ k ~\ ujkSk ’ k = 0,...,m-l. 

Obviously, the 7*, 8k, k = 0, . . . , m — 1, represent a discrete Fourier trans- 
form of length m = n/2. Hence, the discrete Fourier transform of length 
n is reduced to two discrete Fourier transforms of length n/2 followed by 
n multiplications and n additions. If this is done recursively, we arrive at 
the following operation count. Assuming that the u k , k = 1 , . . . , m — 1, are 
precomputed, let M p denote the number of additions and multiplications 
needed for the Fourier transform of length n = 2 P . Then, 

M p = 2M p -i + 2 P+1 

with Mq = 0. From this, by induction, it follows that 



M p — p 2 P+1 = 2nlog 2 n, 
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i.e., that the computational cost is reduced significantly from order 0(n 2 ) 
to order 0(n log 2 n). 

The actual numerical implementation is based on writing the indices k 
and j in a binary representation 



p - i v - i 

— [fco, • • • 5 kp- 1 ] = ^ ^ kq2 q , j = [jo, . . . , jp-l] — jq2 q 

q=0 q = 0 

with k q , j q G {0, 1} for g = 0, . . . ,p — 1. Then 

p-i 



*i = 



n 

q = o 






since 



e -2Tvij q k r 2 q + r - p _ j 



for q + r > p. Inserting this into (8.22), we can split the long sum into p 
nested short sums, and the Fourier transform becomes 



l 



l 



= X V e-^ |to 

n ££ *=0 



1 

X --X Y, e 
ip— i 



- 2 -^-ko 



%0,-Jp-l]* 



Define the intermediate sums 






[io • >ip — q — 1 ikq — 1 v • j^o] 



= £ 



2xii p-i 



g 2 P — <? 






ip-q— 0 



1 

x---x ^ e 

ip— i —0 



-^4pi* 0 



2/[io,— ,i P -i 



for q = 1, . . . ,p and jo, . • . , j P - q -i, k g -i, . . . , k 0 € {0, 1}. Then clearly, 

1 



C [*o,...,*p-i] ^ *^[fc p _i , — ,A?o] ’ 



(8.23) 



and setting 



^[iov,ip-i] 2/[io,.",ip-i]> 
we have the recursive relation 

OQ _ Q9-1 

fio,---,ip-q-l ,kq-l , Aco] [jo,---,jp-q-l,0,k q -2,‘-‘,ko] 

Qf-1 



+S?. -1 ■ , . . e~^ [ko, '"' kq - 

[jo V '>3p — q — 1 »1 — 2 >• • • j^OJ 
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for q = 1, . . . ,p. Each step of these p recursions requires n additions and 
n multiplications. Hence, the total computational cost is indeed of order 
0(n log 2 n). For more details of the actual numerical implementation and, 
in particular, on how to effectively perform the so-called bit reversal in 
order to arrange the result (8.23) in the natural order, we refer to [45]. 

The error analysis for trigonometric interpolation is more complicated 
than the error analysis of the previous section for polynomial interpolation. 
Denote by L n : C[0,27r] — > T n the trigonometric interpolation operator 
that maps the function / onto its trigonometric interpolation polynomial 
L n f. For equidistant grids, by Problems 8.12 and 8.13 we have convergence 
\\L n f — /lb -+ 0, n — » 00 , for each continuous 27r-periodic function / and 
\\L n f — /||oo 0, n — > 00 , for each continuously differentiable 27r-periodic 

function /. For a detailed error analysis we refer to [49]. 



8.3 Spline Interpolation 



As we have seen in our considerations of the convergence of interpolation 
polynomials, increasing the number of interpolation points, i.e., increasing 
the degree of the polynomials, does not always lead to an improvement 
in the approximation. The spline interpolation that we will study in this 
section remedies this deficiency of interpolation by high-degree polynomials 
through a piecewise polynomial interpolation of low degree. 

A frequently used method of this type is piecewise linear interpolation. 
Let a — x 0 < x\ < • • • < x n = b be a subdivision of the interval [a, b]. 
Then a given function / E C[a, b] can be approximated by a continuous 
piecewise linear function by linear interpolation on each of the subintervals, 
i.e., according to Example 8.12, by 



s n (x) = 



x j x j-l 



[f(Xj-l)(Xj -x) + - Xj- 0], X e [Xj-i^j]. 



From the error estimate (8.9) for linear interpolation, we see that for piece- 
wise linear interpolation we have uniform convergence \\s n — /||oo -> 0 
for n — > 00 on [a, 6], provided that h := max^-i,...^ \xj — xj- 1 | -» 0 and 
/ E C 2 [a, b\. The main advantage of this method is its simplicity and its 
stability with respect to errors in the interpolation values. However, since 
by (8.9) linear interpolation has an error only of order 0(/i 2 ), for achieving 
a prescribed accuracy it usually requires a much finer discretization than 
some of the higher-order methods described below. 

Definition 8.26 Let a — x 0 < x\ <•••< x n = b be a subdivision of 
the interval [a, b] and m E IN. A function s : [a,b] — > IR is called a spline 
of degree m with respect to this subdivision if s is (m — 1 ) -times continu- 
ously differentiable on [a, 5] and if the restriction of s to each subinterval 
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[xj- 1 , xj] for j = 1 , . . . , n reduces to a polynomial of degree at most m. By 
S™ we denote the set of all splines of degree m for a fixed subdivision. 

Although piecewise polynomials have been studied since the beginning of 
this century, the notation spline was introduced only in 1946 by Schoenberg. 
The term originates from the thin wooden or metal strips that were used 
by draftsmen to fit a smooth curve between specified points. Since small 
displacements s of a thin elastic beam are governed by the fourth-order 
differential equation = 0, cubic splines , i.e., splines of degree three, 
indeed model the draftsmen’s splines. 

Theorem 8.27 5™ is a linear space of dimension m + n. 

Proof. Clearly, 5^ is a linear space, since C m ~ l [a, b] and P m are linear 
spaces. In the sequel we shall use the notation 



x m , x > 0 , 

0 , x < 0 , 

for m G IN. The m + n functions 

u k {x) := (x - x 0 ) k , k - 0, . . . ,m, 

v k (x) := ( x-x k )y, k — 1, . . . , n — 1, 
are linearly independent. In order to see this, let 

m n— 1 



T a k u k + (3 k v k = o. 



(8.24) 



k=Q 



fc = 1 



Then, in particular, 



^2a k (x - x 0 ) k - 0, x€[ar 0 ,*i], 

fc =0 

whence a k = 0 for k = 0, . . . , m. Then we have 

px(x - Xi) m = 0, x E [xi,x 2 ], 

and therefore /3\ = 0. Repeating this argument inductively, it follows that 
Pk = 0, k = 1, . . . ,n - 1. 

To complete the proof, we need to show that each s 6 S ^ can be ex- 
pressed as a linear combination of the functions (8.24). Given a spline 
s 6 S m , by induction we show that there exist constants ao > • • • > ot m and 
/3i,..., f3 n - 1 such that 

m j-1 

s(x) =^2(*k(x - x 0 ) k + ^2/3 k (x - x k )y, xe[x 0 ,Xj], ( 8 . 25 ) 

k = 0 k=\ 
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for j = 1 This is true for j = 1, since on [xo, xi] the spline s coincides 

with an element of P m . Now assume that we have the representation (8.25) 
for some j > 1. Then the difference 

m j — 1 

p(x) := s(x) - at k (x - x 0 ) k - Y2 ~ x *)+ 
k = 0 k = 1 

restricted to the interval [xj , Xj+ 1 ] is in P m . Since the spline s is in C m ~ l [a, 6] 
and p vanishes on [xo,Xj] we have that 

p^(xj) — 0, i = 0, . . . , m — 1 . 

Hence p(x) = /3j(x — Xj)™ on [xj,Xj_|_ i] for some constant /3j , and because 
(x — Xj)!p = 0 on [xo,Xj], the representation (8.25) is proven for j -f 1. □ 



Since the spline space S ^ has dimension m + n, the n+1 interpolation 
conditions at the points xo , . . . , x n are not sufficient to determine uniquely 
a spline of degree greater than one. Therefore, we need to add additional 
requirements in the form of conditions at the two endpoints xq — a and 
x n = b of the interval. Since we want to divide the number of these end 
conditions equally between both ends, we consider only odd degrees m. 

Lemma 8.28 Let m — 2£ — \ with £ E IN and l > 2, and let f E C e [a , b]. 
Assume that the spline s E S £ interpolates f , i.e., 

s(xj) = f(xj), j = 0, . . . , n, (8.26) 

and that it satisfies the boundary conditions 

s W(a) = /«(a), *<*>(&) = /0) (6), j = (8.27) 



Then 



f [f ( ' t Hx)-s i '^(x)] 2 dx= ( [f^\x)] 2 dx - j [s^(x)] 2 dx. (8.28) 

J a J a J a 



Proof. We have that 

r b 



f [f (£ H x ) ~s {e) (x)] 2 dx = f [f^(x)] 2 dx — f [s^(x)] 2 dx - 2R, 
J a J a J a 



where 



rb 

R := j [f^ (x) — s^ (x)]s w (x) dx. 
J a 



Since / E C e [a,b] and s E C m_1 [a, b] has piecewise continuous derivatives 
of order m, by £ — 1 repeated partial integrations and using the boundary 
conditions (8.27) we obtain that 

rb 

R=(- l) £ ~ l / [/'(x) - s'(x)]s (m) (x)dx. 

J a 
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A further partial integration and the interpolation conditions now yield 

R = (- f lf(x)-s'(x)}s^(x)dx 
j = 1 Jx i~ l 



= £[/(*) -*(*)]« (m) (*) 

i=i 



Xj 

= 0 , 

® j — i 



since = 0. This completes the proof. 



□ 



Lemma 8.29 Under the assumptions of Lemma 8.28 let f = 0. Then 
8 = 0. 



Proof. For / = 0, from (8.28) it follows that 




dx = 0. 



This implies that — 0, and therefore s G Pt ~ i on [a, 6]. Now the bound- 
ary conditions s^ (a) = 0, j = 0, . . . , t — 1, yield s — 0. □ 



From the proof it can be seen that Lemmas 8.28 and 8.29 remain valid 
if the boundary conditions (8.27) are replaced by 

s (i+j) (a) = s^ j) (b) = 0, j = 0, ...,£- 2, 

or, provided that / is periodic with period b— a, by the periodicity condition 

s^(a) — s^(b), j — 1, ...,£ — 1. 

Consequently, the following conclusions drawn from Lemma 8.29 are also 
true for these two end conditions. However, from a practical point of view 
only the latter modification is of relevance. 

Theorem 8.30 Let m = 2£ — 1 with t G IN and £ > 2. Then , given n - 1-1 
values 2/o, • • • , Un and m — 1 boundary data ai, . . . , a^_i and &i, . . . , 
iftere exists a unique spline s G satisfying the interpolation conditions 

s(xj) = Vj, j = 0,...,n, (8.29) 

and the boundary conditions 

s^(a)=a jt sW(b)=bj, j = (8.30) 

Proof. Representing the spline in the form (8.25), i.e., 

m n— 1 

s(x ) = a k u k + ^2 (3 k v k , 

k=o fc=l 



(8.31) 
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it follows that the interpolation conditions (8.29) and boundary conditions 
(8.30) are satisfied if and only if the m + n coefficients ao, . . . , a m and 
Pn-i solve the system 



m n— 1 

+ ^20kVk(xj) 

k=0 k=l 



m n — 1 

52 aku k ] (°) + 52 (°) 



A:=0 






m n— 1 

52 aku k ] ( b ) + 12 ( fc ) 



fc =0 



/c= 1 



= 2/j, j = 0, . . . , n, 

= fli, j = 1,...,^- 1, (8.32) 

= ^ J 5 J 1 , . . . , ^ 1 , 



of m + n linear equations. By Lemma 8.29 the homogeneous form of the 
system (8.32) has only the trivial solution. Therefore, the inhomogeneous 
system (8.32) is uniquely solvable, and the proof is finished. □ 



In principle, for the actual computation of the interpolating spline, it 
is possible to use the linear system (8.32). However, as a consequence of 
the global nature of the basis functions (8.24), this system turns out to be 
ill-conditioned. Therefore, it is preferable to use the corresponding linear 
system derived from another set of basis functions known as basic splines, 
or simply B-splines. As opposed to the splines (8.24), the B-splines have 
local support, i.e., they differ from zero only within m - hi neighboring 
subintervals. 

For the sake of simplicity we confine our analysis of B-splines to the case 
of an equidistant subdivision of step length h. We set 



B 0 (x) 




\x\ < 0.5, 
| a; | > 0.5, 



and define recursively 



r x + 2 

B m +i{x) I ^ B m (y)dy , x £ IR, m = 0,l,.... 



(8.33) 



Then, by induction, it can be seen that the B m are (m - l)-times con- 
tinuously differentiable and nonnegative, vanish outside the interval 
[— m/2 — l/2,m/2-|-l/2], and reduce to a polynomial of degree m in each 
of the intervals [i, i + 1] for m odd and [i — 1/2, i -f 1/2] for m even for i an 
integer; i.e., the B m are splines of order m on an integer grid if m is odd 
and on a half integer grid if m is even. 
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Elementary integrations show that 

i - i*i, 

0, 



Bi(x) 



W < i» 
1*1 > 1 . 



B2(X) = ~ { 



[ 2 - (|*| - 0.5) 2 - (|*| 4- 0.5)) 2 , 

( 1*1 - 1 - 5 ) 2 . 

I o, 



1*1 < 0.5, 

0.5 < |*| < 1.5, 
1*1 > 1.5, 



B 3 (x) = - { 



[ (2 - |x|) 3 - 4(1 - |*|) 3 , |*| <1, 

(2-|*|) 3 , 1 < |*| < 2, 

0 , |*| > 2 . 



Graphs of these B-splines are given in Figure 8.1. 




i 





Theorem 8.31 For m £ IN U {0} the B-splines 

Bm (' ^)> h — 0, . . . , 771 , 

are linearly independent on the interval I m := 



(8.34) 



(8.35) 



(8.36) 



(8.37) 
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Proof. This is trivial for m = 0, and we assume that it has been proven for 
degree m — 1 for some m > 1. Let 

m 

y: akB m (x - k) = 0, xG J m . (8.38) 

k=0 



Then, with the aid of (8.33), differentiating (8.38) yields 






k = o 



B 



m — 1 



(*- 



fc + 



Bjn — l ( X 



k .I 

* 2 



— 0, x G Ijn 



Observing that the supports of B m - 1 (• + |) and J5 m _i (• — m — |) do not 
intersect with J m , we can rewrite this as 






*=i 



atk-i]B m -i (x - k + 0 = 0, 



X G Ini') 



whence = &k-i for k = 1, . . . , m follows by the induction assumption; 
i.e., ak = a for k = 0, . . . , m. Now (8.38) reads 



m 

jB m (x - fc) = 0, X e Ini) 

k=0 

and integrating this equation over the interval I m leads to 

rir+s 

a / B m (x) dx = 0. 

/ rn 

J 2 2 

This finally implies a = 0, since the are nonnegative, and the proof is 
finished. □ 



Corollary 8.32 Let Xk — Q> 4- hk , k = 0, . . . ,n, be an equidistant subdi- 
vision of the interval [a,b] of step size h — (b — a)/ n with n > 2, and let 
m = 2£ — 1 with £ G IN. Tften B-splines 

B m ,k{x) := ( *"«-/» . * ) , x 6 [a, 6], (8.39) 

/or k = — £ -\- 1, . . . -\- £ — 1 /orra a basis for S £ . 

Proof. The n + m splines (8.39) belong to S'™ , and by the preceding The- 
orem 8.31 they can be shown to be linearly independent on [a, 6]. Hence, 
the statement follows from Theorem 8.27. □ 



The use of the B-splines as a basis opens up another possibility for the 
computation of an interpolating spline. We only consider the case m — 3, 
i.e., cubic splines. From (8.36) we note that 

fl 3 (0) = !, * 3 (±1) = !, *3(0) = 0, B' 3 (± 1) = t!- 




176 8. Interpolation 



Therefore, the cubic spline 



n-f-i / x 

s{x ) = ak Bs ( X h* ) ’ X € 

k =- 1 ' ' 



(8.40) 



satisfies the interpolation conditions (8.29) and the boundary conditions 
(8.30) if and only if the n- f- 3 coefficients a_i, . . . , On+i satisfy the system 

1 1 

~2 a ~ l + 2 ai ~ lCLl ’ 

1 2 1 

6 a -' _1 + 3 a i + 6 a i +1 =y i' * = °> • ' • ’ ^ 8 ’ 41 ^ 
1 1 

~2 a n~ i + 2 a n+i — nbi, 

of n + 3 linear equations. Since the matrix of this system is irreducible and 
weakly row-diagonally dominant, the solution can be obtained by Jacobi 
iteration (see Theorem 4.7). 

We conclude this section with an analysis of the interpolation error for 
cubic splines and note that the results can be extended to arbitrary odd 
degree. We begin with a convergence result for arbitrary subdivisions under 
a weak regularity assumption on the interpolated function. 



Theorem 8.33 Let f : [a, 6] -» IR be twice continuously differentiable and 
let s £ S™ be the uniquely determined cubic spline satisfying the interpola- 
tion and boundary conditions of Lemma 8.28 . Then 

h 3 / 2 

11/ - slice < Y- ||/"||2 and ||/'- S '||oo<* 1/2 ||/"||2, 



where h := maxj=i v .. jn | Xj — xj-\ 



Proof. The error function r f — s has n - hi zeros xo, . . . ,x n . Hence, the 
distance between two consecutive zeros of r is less than or equal to h. By 
Rolle’s theorem, the derivative r* has n zeros with distance less than or 
equal to 2 h. Choose z G [a,b] such that |r'(z)| = Hr'Uoo. Then the closest 
zero C of r f has distance \(-z\ < h, and by the Cauchy-Schwarz inequality 
we can estimate 




From this, using Lemma 8.28 we obtain ||r'||oo < y/h\\f" H 2 - 
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Choose x G [a, b] such that |r(x)| = ||r ||oo . Then the closest zero £ of r 
has distance |£ — x\ < h/ 2 , and we can estimate 




< * n 



< ^ iirib, 



which concludes the proof. 



□ 



If we assume more regularity on /, we can improve on the order of 
convergence. For this we need to derive an estimate on the second derivative 
of the interpolating spline. From (8.36) it follows that 

B”( 0) = -2 and £3 (±1) = 1. 

Hence, the cubic spline (8.40) has second derivatives given by the difference 
formula 

s"( Xj ) = j -2 K -1 - 2aj + a i+ i], j = 0, . . . ,n. (8.42) 

Prom this we deduce that 

h?[s"(xj-i ) + 4s" (xj) + s"(xj+i)] = [oj -2 + 4aj_i + otj\ 

—2[a>j-i + 4a j + o,+i] 



+[ a j + 4a J+ i + aj+ 2 ] 



for j = 1 , . . . ,n - 1 , 

h 2 [4s"(x 0 ) + 2s"(xi)] = 6 [a_i - aj - 2 [a_! + 4a 0 + ai] 

+2[ao + 4r*i + «2]> 



and 

h 2 [2s"(x n -i) + 4s"(x„)] = 2[a „_ 2 + 4a n _! + a„] 



— 2[a n _i + 4o; n + a n+ i] — 6 [a n _i — a n +i]- 

From this and the linear system (8.41), for the special case of the interpo- 
lation conditions (8.26) and the boundary conditions (8.27), it follows that 
the n + 1 values of s" at the grid points satisfy the system 

4s"(x 0 ) + 2s"(x 1 ) = F 0 , 

s (xj—i ) + 4s ( Xj ) + s (xj - (_i ) = Fj, j = 1, . . . , n — 1, 

2s"(x n _i) -f 4s"(x n ) = F n , 



(8.43) 
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of n + 1 linear equations with right-hand sides 

F ° ~ ^ + /( X l) - h f'( X o)]. 

g 

F j : = if( x J- 1) ~ 2f( x j) + f( x i+ i)l» j = 1, • • • ,n - 1, 

F n ; = ^ )-/(*«) + ft/'(ar„)]. 

Prom the system (8.43) we can conclude that 

4|«"(®j)| < l-Fjl -2 max |a"(a:*)|, j=0,...,n, 

k=0,...,n 

and therefore 

max \s"(xj)\ < \ max \Fj\. (8.44) 

If / is twice continuously differentiable, by Taylor’s formula we can estimate 

max{|F 0 |,|F n |}< 6 ll/loo. 

From Example 8.12, applied to the remainder in the linear interpolation of 
f{xj) from f(xj- 1 ) and f(xj+ 1 ), we obtain 

l*jl = - 2 / ( x i ) + f(x j+ 1 )| < 6||/"||oo, j = 1, . . . ,n - 1. 

Hence, since s” is piecewise linear, from (8.44) it follows that 

||s"l|oo<3||/"||oc. (8.45) 

Theorem 8.34 Let f : [a, b] — > JR be four-times continuously differen- 
tiable and let s £ Sff be the uniquely determined cubic spline satisfying the 
interpolation and boundary conditions of Lemma 8.28 for an equidistant 
subdivision with step width h. Then 



ll/-«Hoo<^||/ (4) ||oc. 

Proof. By Li : C[a, b] — > S J 2 we denote the interpolation operator mapping 
g £ C[a, b] onto its uniquely determined piecewise linear interpolation. 
From Example 8.12 we obtain that 

||r||oo = ||r-L 1 r|| 00 <^||rl 0O , 



since trivially L\r = 0. 
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By integration, we choose a function w such that w " = Lif " . Applying 
the estimate (8.45) for the cubic spline s - w and using the estimate (8.9) 
for the piecewise linear interpolation of we obtain 

ll/"-*loo < ll/"-il/"l|oc + ||il/"-s"||oo < 4||/"-L 1 /"||oo < y ||/ (4) ||oo 

By piecing together the last two inequalities we obtain the assertion of the 
theorem. □ 



8.4 Bezier Polynomials 

In this section we want to introduce some of the basic ideas of computer- 
aided geometric design. We will confine our presentation to planar (and 
spatial) curves, i.e., to subsets T C IR m , m = 2,3, that can be described 
by a continuous mapping / : D — > IR m of an interval D C JR into lR m . 
For the purposes of computer-aided geometric design it is essential that 
the geometric objects can be visualized and manipulated on the computer 
very effectively and rapidly. This, in particular, makes it essential that 
the parameters entering the representation of the curves have a geometric 
meaning. The latter property, for example, is not fulfilled by polynomial 
curves represented through the classical monomial basis. 

Definition 8.35 For n £ IN U {0}, we denote by P™ the linear space of 
polynomials of the form 

n 

p( x ) = ^2 a k xk , X e JR, 
k = 0 

where ao, . . . ,a n € IR m . A polynomial p € P™ is said to be of degree n if 
Q"n ^ 9* 

We proceed by introducing a basis for polynomials on an interval [a, b] 
in IR with a < b that is better suited for the purposes of computer-aided 
design than the monomial basis. For this we make use of the fact that by 
the affine linear transformation 

x i->> t(x) := y — - (8.46) 

b — a 

the interval [a, b] can be mapped on the interval [0,1]. By the binomial 
formula we have that 

i = [t + a - 1) r = £ 

k = 0 

The terms in this partition of unity are called Bernstein polynomials for 
the interval [0,1]. From these, the Bernstein polynomials for the interval 
[a, b] are obtained via the transformation (8.46). 
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Definition 8.36 The Bernstein polynomials of degree n for the interval 
[0, 1] are given by 

k = 0, . . . ,n. (8.47) 

Correspondingly , the polynomials 

B?(x; a, 0) := B t 0) = ^ (") <.-a)‘<6-*)-\ fc = 0 n. 

are called Bernstein polynomials of degree n for the interval [a, b ] . 

Some basic properties of Bernstein polynomials are described in the fol- 
lowing theorem. 

Theorem 8.37 The Bernstein polynomials are nonnegative on [0, 1] and 
provide a partition of unity; i.e., 

£?(*)> 0, t€[0,l], (8.48) 

and 

n 

£>j?(t) = l, teR. (8.49) 

k = 0 

They satisfy the relations 

Bm = - t), k = 0,...,n, (8.50) 

and 

Bo(t) = (1 - t)B0 (t), B%(t) = tB% Z\ (t) (8.51) 

for all t e IR and n € IN. The point t = 0 is a zero of B% of order fc, and 
t = 1 is a zero of order n — k. Each of the polynomials B% assumes its 
maximum value only at t = kjn. They satisfy the recursion relation 

Bm = tB n k z\ (t) + (1 - t)B0 ( t ), t 6 IR, (8.52) 

for n e IN and k = 1, . . . , n — 1. The polynomials Bft, . . . , BJJ form a basis 
of P n . 

Proof. The first five properties are obvious. The statement on the maximum 
of B% is a consequence of 

J0m={fyt k - x {\-t) n - k -\k-nt), k — 0, . . .n. 

The recursion formula (8.52) follows from the definition (8.47) and the 
recursion formula 

G)-(:::mv) 
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for the binomial coefficients. In order to show that the n - f 1 polynomials 
of degree n provide a basis of P„, we prove that they are linearly 
independent. Let 

n 

£>*£?(*) = 0, t e [o, l], 

fc= 0 

Then 

JL do 

J2h-^BZ(t) = 0, t £ [0, 1], 

k = 0 

and therefore 

JL ( p 

0) = 0’ J = 0 > -’ n > 

k=j 

since £ = 0 is a zero of B% of order fc. From this, by induction we find that 
b n = • • • = b 0 = 0. □ 

Definition 8.38 The coefficients bo , . . . , b n £ lR m in the representation of 
a polynomial p £ P™ through the Bernstein basis 



n 

Pi x ) ~ bkB%( x ; a, 6), x £ [a, 6], (8.53) 

k = o 

are ca/Zed control points, or Bezier points, of p. The polygon determined by 
them is called the Bezier polygon. 

We now want to indicate that the graph of the polynomial p is closely 
related to the form of the Bezier polygon, and for this reason the graph 
of p is often referred to as Bezier curve. We first note that p(a) = bo 
and p(b) = b n ; i.e., both endpoints of the Bezier curve and the Bezier 
polygon coincide. Furthermore, from (8.49) it follows that the Bezier curve 
is contained in the convex hull con{bo, . . . ,b n } of the Bezier points. The 
convex hull 



con{6o, • • • ? 



n 

^2°tkbk : otk > 0 , 

k=o 




is the smallest convex set containing the points bo,...,b n (see Problem 
8.19). 

For computing the derivatives of a Bezier curve we first note that 
^ BJ(t) = (”) [kt k ~ l (l - t) n ~ k - (n - k)t k (l - t 
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implies that 






k = 0, 



m' = { 



n{B n k l{ - B n k ~ l ), k — . ,n - 



(8.54) 



nn n- 1 5 



k — n. 

With this identity we are ready to establish the following theorem. 

Theorem 8.39 Let 



it 

p(t) = Y! bkB k( t )’ ^[ m ], 



k=0 



be a Bezier polynomial on [0, 1]. Then 



n-j 



P U) (t) = £ &b k BZ-i(t), j = 1 

with the forward differences A recursively defined by 

A °b k :=b k , j = l,...,n. 

Proof. Obviously, the statement is true for j = 0. We assume that it has 
been proven for some 0 < j < n. Then with the aid of (8.54) we obtain 



n(i + 1 ) 






n ! 



I n-j 



E AJb ^ B r j (t) 



k - 0 

, (n-j 



dt 



n-j - 1 



(n-j)! 

{A>6 t+1 - A^jsr'-'c) 



r'-'mj 



n\ 



[n~(j + 1 )]! 



n-{j+ 1) 

£ A^ +i 6 fc Br (i+1) w, 



k=0 



which establishes the assertion for j + 1. 

Corollary 8.40 The polynomial from Theorem 8.39 has the derivatives 



□ 



P (i \ 0) 



n\ 



-A^o, P (j) (l) = T-^^bn-j 



(n-j)! 



(n - j)! 



at the two endpoints. 
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From Corollary 8.40 we note that (0) depends only on6 0 ,...,6j and 
that 1) depends only on 5 n _j, . . . ,b n . In particular, we have that 

p'( 0) = n(bi - b 0 ), p'( 1) = n(b n - 6 n -i); (8.55) 

i.e., at the two endpoints the Bezier curve has the same tangent lines as 
the Bezier polygon. Through the affine transformation (8.46) these results 
on the derivatives carry over to the general interval [a, b]. 




FIGURE 8.2. Bezier polynomials of degree two 

Figure 8.2 illustrates by two Bezier polynomials of degree two in IR 2 how 
the shape of the curve is influenced by the location of the control points bi . 
From (8.55) we also observe how to patch two Bezier polynomials of degree 
two together smoothly such that the tangent lines at the joints coincide, 
i.e., such that the two polynomials match up to a Bezier spline of degree 
two. The Bezier polynomials have the same tangent lines at the joints if 
the Bezier polygons do. This is illustrated by Figure 8.3. 




We will conclude this section by describing the de Casteljau algorithm 
as a very stable and fast method for computing the function values p(t) of 
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a Bezier polynomial. Given a Bezier polynomial 

n 

p(<) = *e[o,i], 

k=0 

we define the subpolynomials G P™ by 

k 

b^(t):=J2b i+j B^t) (8.56) 

j= 0 

for i = 0, . . . , n — k and k = 0, . . . , n. For polynomials on [a, 6] we have 
an analogous definition for the subpolynomials. The subpolynomial is 
a polynomial of degree k and has the k + 1 control points 5*, . . . , bi+k- 
In particular, we have that b q = p. Analogous to the Neville scheme of 
Theorem 8.9 we have the following recursion formula, which is the basis of 
the de Casteljau algorithm. 

Theorem 8.41 The subpolynomials b * of a Bezier polynomial p of degree 
n satisfy the recursion formulae 

b k (t) = (l-t)b k ~ 1 (t) + tb k ^{t) (8.57) 

for i — 0, . . . , n - k and k — 1 , . . . , n. 

Proof. We insert the recursion formulae (8.51) and (8.52) for the Bernstein 
polynomials into the definition (8.56) for the subpolynomials and obtain 

k- 1 

b*(t) = biB k 0 (t) + Y, bi+jBj(t) + b i+k B k k (t) 

j = 1 

k — 1 k 

= E - <)£* _1 w + E 1 1 w 

3 = 0 3 = 1 

= (i-t)b k ~ 1 (t) + tb k - 1 l (t), 

which establishes (8.57). □ 

Since (t) = p(t ), starting the recursion with b^t) = bk , from (8.57) we 
can compute p(t) by successive convex combinations of the Bezier points 
bo , . . . ,6 n , which clearly is a numerically stable procedure. Since (8.57) is 
similar in structure to the divided differences in Definition 8.4, the compu- 
tations can be arranged in a tableau analogous to the one for the divided 
differences. 

From the coefficients of the de Casteljau tableau we can construct two 
Bezier polynomials on the subintervals [0, t] and [; t , 1] that coincide with 
the original Bezier polynomial on the full interval [0, 1]. 
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Theorem 8.42 The Bezier polynomials 

n n 

Pl (x) :=Y,t>o(t)BUx; 0,0 and ^(x) :='*jTK~ k ( t )Bk{x;t,l) 

k=0 k= 0 

with the coefficients b k and b^~ k for k = 0, . . . , n defined by the recursion 
(8.57) satisfy 

p(x) = pi(x) =P 2 (x), x G 1R, 
for arbitrary 0 < t < 1 . 

Proof Inserting the equivalent definition (8.56) of the subpolynomials and 
reordering the summation, we find that 

Th k Tl Tl 

pi 0*0 = 0,0 = £bJ(«)bE(i;0,«). 

k= 0 j=0 j— 0 k—j 

Hence the proof will be concluded by showing that 

n 

Y,B k j {t)BUx ] 0,t) = B"{x), *£E. (8.58) 

k—j 

To establish this identity we make use of Definition 8.36 and obtain with 
the aid of the binomial formula that 

k=i k =j w w 






n-j 



t j ~ n x j y 



k = 0 



n - J 



(1 -t) k x k (t-x) 



n-j — k 




-x) n ~ j . 



Hence (8.58) is valid, and consequently pi = p. The proof of p^ = p is com- 
pletely analogous, and it can also be obtained by a symmetry argument 
from pi — p. □ 



A natural choice in the subdivision of Theorem 8.42 is to break the 
interval in half by taking t — 1/2. Successively repeating the subdivision 
leads to a sequence of Bezier polygons that converges rapidly enough to 
the original Bezier curve to make this subdivision algorithm practical for 
an effective visualization of the curve on a computer. 




186 8. Interpolation 



Problems 

8.1 Let Mi, . . . , Un G C[a, b] be linearly independent and let #i, . . . , x n G [a, b] 
be distinct. For given values yi, . . . ,y„ G IR consider the interpolation problem 
of finding a function u G U n := span{ui , . . . , u n } with the property 

u ( x j) = Vh 3 = 

Show that the following three properties are equivalent: 

(a) The interpolation problem is uniquely solvable for each given set of values 
2/1 ) • • • > Vn G IR. 

(b) Each function u G U with zeros u(xj ) = 0 for j = 1, . . . , n vanishes identically. 

(c) The n x n matrix with entries Uk(xj) for j, k = 1, ... n is regular. 

8.2 Consider the interpolation of f(x) := x 4 by a polynomial p G P3 with the 
four interpolation points —1, 0, 1, 2. Discuss the behavior of the error p — / in the 
interval [—1,2]. 

8.3 Write a computer program for the Neville scheme of Theorem 8.9. 

8.4 Show that the interpolation operator L n : C[a,6] -» P n given by (8.7) is a 
linear operator. Show that it is a bounded operator if both the domain and range 
space are equipped with the maximum norm. 

8.5 Let xo , • . . ,x n G IR be n + 1 distinct points. Show that the Vandermonde 
matrix V with entries (xj ) for j, k = 0, 1, ... n has determinant 

detV" = JJ (xj-xjfc). 

0<j<k<n 

8.6 Verify numerically the findings of Runge described in Example 8.14. 

8.7 Verify the relations (8.12) for the Hermite factors. 

8.8 Prove Theorem 8.19, i.e., the representation of the remainder in Hermite 
interpolation. 

8.9 Given a twice continuously differentiable function / : [a, b] -» IR and three 
points xo, x \ , X 2 G [a, 6] with xo ^ £2, show that there exists a unique polynomial 
p G P3 for which 

p(x 0) = f(x 0), p'(x 1) = f(x 1), p"(x 1) = f"(x 1), p( X 2 ) = f(x 2 ). 

Find a representation of the polynomial and give a representation of the re- 
mainder analogous to Theorem 8.10. (This is an example of Hermite- Birkhoff 
interpolation.) 

8.10 Inverse interpolation can be used to solve nonlinear equations f(x) = 0 
approximately by interchanging the roles of interpolation points and interpolation 
values. Find an approximation of the zero x = 1.5 for f(x) = (4x + l) 3 — 343 from 
the values of / at the four points x = 0, 1, 2,3 by inverse cubic interpolation, i.e, 
by interpolating the inverse of / by a cubic polynomial with interpolation points 
/(0), /(l), /(2), /(3) and interpolation values 0, 1,2,3. For the computation use 
the Neville scheme. Are you satisfied with the accuracy of the result? 
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8.11 For the trigonometric interpolation from Theorem 8.24 with 2n±l equidis- 
tant interpolation points show that the Lagrange factors are given by 

4 (t) = F(t-tk), & = 0, . . . , 2n, 



where 



F(t) := 



^ sin (n 4- ~ + 1^ t 



n + 1 



for t / 0, ±27 r, ±47r, Prove that 

/*2ir 



f 



£j(t)ik(t) dt = 



2ir 

2n ± 1 



5jk‘ 



8.12 For the trigonometric interpolation from Theorem 8.24 with 2n+l equidis- 
tant interpolation points show that 



|| L n f — f H 2 — ► 0, n — > 00 , 

for each continuous 27r-periodic function /. 

Hint: With the aid of Problem 8.11, show that 

\\L„gh < V55F IMIoo 

for all n € IN and all continuous 27r-periodic functions / and use the Weierstrass 
approximation theorem for periodic functions. 

8.13 For the trigonometric interpolation from Theorem 8.24 with 2n+l equidis- 
tant interpolation points show that 

II L n f - /||oo 0, n -* 00 , 

for each continuously differentiable 2n periodic function /. 

Hint: For the functions fk(t) := e lkt show that 

||L„A- Alloo <2 

for n = 1,2,... and k = 0, ±1, ±2, . . . , and use the fact that the Fourier series 
for continuously differentiable functions is uniformly convergent. 

8.14 Write a computer program for the fast Fourier transform. 

8.15 Given n distinct points zi , . . . , z n £ [a, b], n distinct points xi, . . . , x n in 
[a, 6], and n values yi , . . . ,t/ n € IR, show that there exists a unique function of 
the form 

k = 1 

with real coefficients ai, . . . , a n such that 

u(xj) = yj, j = l,...,n. 
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8.16 Verify the relations (8.34)-(8.36) for B-splines. 

8.17 Use the fact that the second derivative of a cubic spline is a piecewise linear 
function to derive the linear system (8.43) without using the B-spline (8.36). 
Hint: On each subinterval integrate the piecewise linear function for s " twice and 
eliminate the integration constants through the interpolation conditions. Then 
use the continuity of s' to obtain the linear system. 

8.18 For the Bernstein polynomials show that 

V-BJf (*) = *, teJR, 

' n 

k= 0 



and 



E 

k = 0 



^ Bl{t) = t 2 + - , 
n 2 n n 



t e JR. 



8.19 Show that the convex hull 



con{6 0 , : a* > 0, 



J2 ak 

k - 0 



-) 



of n -I- 1 points • • • , b n € lR m is convex and that con{6o, . . . , b n } C U for each 
convex set U with &o, • • > b n 6 U. 



8.20 Give the Bezier representation of the (cubic) Hermite factors of Theorem 
8.18 for the case of two interpolation points. Draw the graphs of the Hermite 
factors and their Bezier polygons. 
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Numerical integration formulae, or quadrature formulae, are methods for 
the approximate evaluation of definite integrals. They are needed for the 
computation of those integrals for which either the antiderivative of the in- 
tegrand cannot be expressed in terms of elementary functions or for which 
the integrand is available only at discrete points, for example from exper- 
imental data. In addition and even more important, quadrature formulae 
provide a basic and important tool for the numerical solution of differential 
and integral equations, as we shall see in Chapters 10, 11, and 12. 

The evaluation of planar areas bounded by curves is one of the oldest 
problems in science. Attempts to measure the area bounded by circles, 
ellipses, and parabolas were undertaken already by the Babylonians, Egyp- 
tians, and Greeks. However, a systematic analysis only became possible 
after the invention of calculus. Newton interpolated functions at equidis- 
tant points and integrated the interpolating polynomial and thus invented 
what now is known as the Newton-Cotes quadratures. Describing these in- 
terpolator quadrature formulae will be the subject of Sections 9.1 and 9.2. 
Gauss was the first to notice that nonequidistant interpolation points lead, 
in general, to better accuracy for the resulting approximations to the inte- 
grals. In 1814 he presented a paper entitled “Methodus nova integralium 
valores per approximationem inveniendi” introducing quadrature formulae 
with the degree of accuracy considerably improved as compared with the 
Newton-Cotes formulae. These Gaussian quadrature formulae will be the 
subject of Section 9.3. The remaining part of this chapter is based on the 
Euler-Maclaurin expansion, which was found and published independently 
by Euler (1738) and Maclaurin (1737). We shall first employ the Euler- 
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Maclaurin expansion in our analysis of numerical integration of periodic 
functions. We will then use it to develop Romberg integration as a typical 
example for the use of the extrapolation method in order to increase the de- 
gree of accuracy. And finally, for integrands with endpoint singularities we 
will describe quadrature formulae that are based on a mesh that is graded 
towards the endpoints, and we will analyze the error with the help of the 
Euler-Maclaurin expansion. 

For a comprehensive study of numerical integration methods including 
multidimensional integration we refer to [9, 17, 21, 57]. 



9.1 Interpolatory Quadratures 



The most common quadrature formulae approximate the definite integral 




f(x) dx 



(9.1) 



of a continuous function / over the interval [a, b] with a < b by a weighted 
sum 

n 

Q n (f) :=Y,*kf(x k ) (9.2) 

k = 0 

with n - 1-1 distinct quadrature points xo,--.,x n E [a, b] and quadrature 
weights do, . . . , a n E 1R. As one of the main applications of interpolation 
as developed in the previous chapter, an important group of quadrature 
formulae is obtained by integrating an interpolating polynomial instead of 
the integrand /, i.e., by approximating 



rb rb 

/ f(x)dxz a / ( L n f)(x)dx , 
J a J a 



where L n : C[a,b] -> P n denotes the polynomial interpolation operator 
with interpolation points xo,...,x n introduced in Section 8.1 (see (8.7)). 
Note that both the integral Q and the quadrature formula Q n represent 
linear operators from C [a, b] into IR. 



Theorem 9.1 The polynomial interpolatory quadrature of order n defined 
by 



Qn(f) := [ (Lnf)(x)dx 
J a 



is of the form (9.2) with the weights given by 

rb 



_ 1 f Qn+ 1 (a;) u _ n 

o>k — t / \ I dx , k — 0, . . . , 7i, 

Qn+li**) Ja x ~ x k 
where q n+ 1 (x) (x - x 0 ) • • • (x - x n ). 



(9.3) 



(9.4) 
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Proof. From (8.2) we obtain 





£k( x ) dx 



with 



Q>k 







dx , 



whence (9.4) follows by rewriting the product. 



□ 



The following theorem describes an equivalent definition of polynomial 
interpolatory quadratures. 



Theorem 9.2 Given n + 1 distinct quadrature points #o, • • • ,x n G [a, b], 
the interpolatory quadrature (9.3) of order n is uniquely determined by its 
property of integrating all polynomials p G P n exactly , i.e., by the property 



n pb 

^2 a kP(Xk)= / p{x)dx 
k—0 



(9.5) 



for all p£ P n . 



Proof. From (9.3) and L n p = p for all p £ P n it follows that 



n pb pb 

5 2 a kP(xk)= / ( L n p)(x)dx= / p(x)dx; 
k = o Ja Ja 



i.e., the quadrature is exact for all p £ P n . On the other hand, from (9.5) 
we obtain 



n n ~b 

Yl a kf(Xk) = ^ a k{ L nf){ x k) = / ( L n f)(x)dx 
k = 0 k=0 Ja 



for all / G C[a,b\\ i.e., the quadrature is an interpolatory quadrature. □ 

Theorem 9.3 The polynomial interpolatory quadrature of order n with 
equidistant quadrature points 



x k = a + kh , k = 0 , . . . , n, 

and step width h — ( b—a)/n is called the Newton-Cotes quadrature formula 
of order n. Its weights are given by 

a k -h ^ k y f I ~[( z ~j)dz, k = 0,...,n, (9.6) 

0 j = o 

and have the symmetry property a k — an-k ? k = 0, . . . , n. 
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Proof. The weights are obtained from (9.4) by substituting x = x$ 4- hz 
and observing that 

Qn+ l(x) = h n+1 U(z-j) 
i = o 



and 

q' n+1 (x k ) = (-l) n - k k\(n-k)\h n . 

The symmetry ak = a n _* follows by substituting z — n — y. 



□ 



These quadrature formulae were first discovered by Newton and also 
carry the name of Cotes because of his systematic account of Newton’s 
integration rules in 1711. The Newton-Cotes quadrature formula of order 
n = 1 is known as the trapezoidal rule. Its weights can be obtained ei- 
ther from evaluating (9.6) or more easily from the exactness conditions of 
Theorem 9.2. For the interval [—1,1], these conditions are given by 



a 0 + o>\ 



— CLq + di 




dx — 2, 



xdx — 0, 



and imply that ao = a\ = 1. Hence, for a general interval the trapezoidal 
rule has the form 




b — a 
2 



if (a) + /(*)] = ^ [/(*o) + /(*i)]- 



Geometrically speaking, the trapezoidal rule approximates the integral of 
/ by the integral of the straight line connecting the two points (a, /(a)) 
and (b,f(b)). Hence, the approximate value coincides with the area of the 
trapezoid with the four corners (a,0), (6,0), (a, /(a)), and (6, f(b)). 

The Newton-Cotes quadrature formula of order n = 2 was already known 
to Kepler in 1612 and Cavalieri in 1639 and is called Simpson’s rule , since 
Simpson rediscovered it in 1743. Its weights are obtained from the exactness 
conditions 



Oq + Ol <22 — 



— ao ~h a>2 — 



ao + o>2 — 




dx = 2, 



xdx = 0, 
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which imply that a 0 = a ,2 = 1/3 and a\ — 4/3. Hence, for a general interval, 
Simpson’s rule is given by 




f(a)+if 



cl -h b 
2 



+m 



= \ [f(xo)+4f(x 1 )+f(x 2 )}. 



Geometrically speaking, Simpson’s rule approximates the integral of / by 
the integral of the parabola through the three points (a, /(a)), /(- y^)), 

and (6, /(&))- 

Table 9.1 gives the weights of the first four Newton-Cotes formulae (with 
the common factor h— (b — a)/ n omitted). 



TABLE 9.1. Weights of Newton-Cotes formulae 



n 


Ofc 




1 


1 1 
2 2 


Trapezoidal rule 


2 


1 4 1 

3 3 3 


Simpson’s rule 


3 


3 9 9 3 

8 8 8 8 


Newton’s three-eights rule 




14 64 24 64 14 




4 




Milne’s rule 




45 45 45 45 45 





For n > 8 some of the weights of the Newton-Cotes formulae become 
negative (see Problem 9.4). Since this might lead to negative approxima- 
tions for integrals with positive integrands, the higher-order Newton-Cotes 
rules cannot be recommended for numerical purposes. 

We will carry out the error analysis for the Newton-Cotes formulae only 
for the two most important cases, n — 1 and n — 2, i.e., the trapezoidal 
rule and Simpson’s rule. 

Theorem 9.4 Let f : C[a y b] -> IR be twice continuously differentiable. 
Then the error for the trapezoidal rule can be represented in the form 

[ f(x) dx- b -^ [f(a) + f(b )] = ~ f"(0 (9.7) 

with some £ E [a, b] and h — b — a. 

Proof. Let L\f denote the linear interpolation of / at the interpolation 
points Xq = a and x\ = b. By construction of the trapezoidal rule we have 
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that the error 

Ei(f) := £ f(x) dx- b -^ [f(a) + f(b)} 



is given by 

Ei(f)= f i [f(x)-(L l f)(x))dx= f{x - a)(x - b) f(x) _~ ( f f _ )( f dx. 

Ja Ja (X a) [X 0) 



Since the first factor of the integrand is nonpositive on [a, b] and since 
by l’Hopital’s rule the second factor is continuous, from the mean value 
theorem for integrals we obtain that 



Hz) - (Lx/)( 2 ) 






( z — a)(z — b) 



f 



(x — a)(x — b) 



dx 



for some z G [a, b\. From this, with the aid of the error representation for 
linear interpolation from Theorem 8.10 and the integral 




the assertion of the theorem follows. 



□ 



We explicitly note that (9.7) cannot be obtained by integrating the in- 
terpolation error representation (8.8), since we do not know whether the 
intermediate point £ in (8.8) depends continuously on x. 

By construction, Simpson’s rule integrates polynomials of degree less 
than or equal to two exactly. In addition, it also integrates polynomials of 
degree three exactly. By linearity, to show this it suffices to prove it for one 
polynomial of degree three. For the polynomial 

q%(x) = (x- xq)(x - xi)(x - x 2 ) 

both the integral and the value obtained from Simpson’s rule are zero. 
Hence, this polynomial of degree three is integrated exactly by Simpson’s 
rule. 



Theorem 9.5 Let f : C[a , b] — > JR be four-times continuously differen- 
tiable. Then the error for Simpson’s rule can be represented in the form 




/(o)+4/(^)+/(6) 



55 / (4) (0 (9-8) 



for some £ G [a, b] and h = (b — a)/ 2. 
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Proof. Let L^f denote the quadratic interpolation polynomial for / at the 
interpolation points xo = a, X\ — (a -f b)/ 2, and x<i — b. By construction 
of Simpson’s rule we have that the error 



E 2 (f) := f f{x) dx- 
J a 

is given by 



b — a 



f{a) + 4/ 



(=f‘ ) 



+m 



E 2 (/) = [ [f(x)-(L 2 f)(x)]dx. 
J a 

Consider the cubic polynomial 

4 



p(x) := (L 2 f)(x) + 



(b - a) 2 



[(L 2 f)'(xi) - f'(x i)] q 3 (x), 



(9.9) 



(9.10) 



where qs(x ) — (x — xo)(x — x\)(x — ^ 2 ). Obviously, p has the interpolation 
properties 

p(xk) = f(xk), k = 0,1,2, and p'(xi) = f'(xi). 

Since /j* q 3 (x)dx = 0, from (9.9) and (9.10) we can conclude that 



E 2 (f)= [ [f{x) -p(x)]dx, 
J a 



and consequently 

rb 



E 2 if) = ( ix - x 0 )ix - xi) 2 ix - x 2 ) 
J a 



fix) - pix) 



ix - x 0 )(x - Xi) 2 (x - x 2 ) 



dx. 



As in the proof of Theorem 9.4, the first factor of the integrand is non- 
positive on [a, 6], and the second factor is continuous. Hence, by the mean 
value theorem for integrals, we obtain that 



Eiif) = 



fiz)~Piz) 



ix- x 0 )ix 

J a 



xi) 2 [x — x 2 ) dx 



iz - X 0 )(z - Xx ) 2 (z - x 2 ) 
for some z G [a, b\. Analogous to Theorem 8.10, it can be shown that 

f (4) iO 



fiz) ~ Piz) 



4! 



(z - x 0 )(z - Xi) (z - x 2 ) 



for some £ € [a, 6]. Prom this, with the aid of the integral 

rb (6 - o ) 5 



I ix — xo)(x — xi) 2 (x — x 2 ) dx = 
J a 



120 



we conclude the statement of the theorem. 



□ 
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Example 9.6 The approximation of 



In 2 = 



by the trapezoidal rule yields 



For f(x) := 1/(1 + x) we have 



ln2«- 1 + - = 0.75 



Y2 ll/loo = £ , 

and hence, from Theorem 9.4, we obtain the estimate | In 2 — 0.75| < 0.167 

as compared to the true error In 2 — 0.75 = —0.056 

Simpson’s rule yields 



, 1 L 4 1 

ln2 ^6 1+ 1+T + 2 



= § = 0.6944 
36 



and from Theorem 9.5 and 



90 ^°° 120 

we find the estimate | In 2 - 0.6944| < 0.0084 as compared to the true error 
In 2 — 25/36 = —0.0012 — □ 

In order to increase the accuracy, instead of using higher order Newton- 
Cotes rules it is more practical to use so-called composite rules . These 
are obtained by subdividing the interval of integration and then applying a 
fixed rule with low interpolation order to each of the subintervals. The most 
frequently used quadrature rules of this type are the composite trapezoidal 
rule and the composite Simpson’s rule. 

Let Xk = a + kh , k = 0, ... ,n, be an equidistant subdivision with step 
size h — (b — a)/n. Then the composite trapezoidal rule is given by 

ThU) h - f(x o) + f{x i) -I b /(x n _i) + - f(x n ) 



for / € C[a,b]. 

Theorem 9.7 Let f : [a, b] — > JR be twice continuously differentiable. Then 
the error for the composite trapezoidal rule is given by 

£ f(x) dx - T h (f) = h 2 f"(0 

for some £ € [a, b] . 
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Proof. By Theorem 9.4 we have that 

f }{x)dx-T h (f) = “^53 /"(&), 



where a < £ i < £ 2 < * * * < £n < 6- From 



n min 
®e[o,6] 



/"(*) < ]£/"(&) < n max f"(x) 
xe[a,b] 



and the continuity of /" we conclude that there exists £ G [a, 6] such that 



£/"(&) =«/"( 0 . 

*=1 



and the proof is finished. □ 

Let n be even. Then the composite Simpson’s rule is given by 

Sh(f) ■= ^ [/(*£>) + 4/(*i) + 2 /(x 2 ) + 4/(x 3 ) + 2/(x 4 ) 

+ • • • + 2/(x„_ 2 ) + 4/(x n _i) + /(x„)] 

for / G C[a, 6]. Its error can be represented and estimated as follows. 

Theorem 9.8 Let f : [a, 6] — > JR be four-times continuously differentiable. 
Then the error for the composite Simpson’s rule is given by 

f* fix) dx - S h (f) = /i 4 / (4) (6 

for some £ G [a, 5]. 

Proof. Using Theorem 9.5, the proof is analogous to the proof of Theorem 
9.7. □ 

Table 9.2 gives the error between the exact value of the integral from Ex- 
ample 9.6 and its numerical approximation by the composite trapezoidal 
rule and the composite Simpson’s rule. Clearly, if the number n of quadra- 
ture points is doubled, i.e., if the step size h is halved, then the error for 
the trapezoidal rule is reduced by the factor 1/4 and for Simpson’s rule by 
the factor 1/16, as predicted in Theorems 9.7 and 9.8. 
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TABLE 9.2. Trapezoidal and Simpson’s rule for Example 9.6 



n 


Trapezoidal rule 


Simpson’s rule 


1 


-0.05685282 




2 


-0.01518615 


-0.00129726 


4 


-0.00387663 


-0.00010679 


8 


-0.00097467 


-0.00000735 


16 


-0.00024402 


-0.00000047 


32 


-0.00006103 


-0.00000003 



9.2 Convergence of Quadrature Formulae 



Definition 9.9 A sequence ( Q n ) of quadrature formulae is called conver- 
gent if 



Qn(f) Q(f) = 




n -> oo, 



for all f € C[a, b\. 



Theorem 9.10 (Szego) Let 



Qnif) = Ys a k >) f( X k n) '> 

k = 0 

be a sequence of quadrature formulae that converges for all polynomials, i.e, 

lim Q n (p) = Q(p ) (9-11) 

n— >• oo 

for all polynomials p, and that is uniformly bounded , i.e., there exists a 
constant C > 0 such that 

X>L n) i<c ( 9 - 12 ) 

k = 0 

for all n € IN. Then the sequence ( Q n ) is convergent. 

Proof. Let / G C[a, b] and e > 0 be arbitrary. By the Weierstrass approxi- 
mation theorem (see [16]) there exists a polynomial p such that 

ll/-p||o ° - 2 (C + b- a) ’ 

Then, since by (9.11) we have Q n {p) Q(p) as n — > oo, there exists 
N(e) G IN such that 

I Qn(p) - Qip ) I < | 
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for all n > N(e). Now with the aid of the triangle inequality and using 
(9.12) we can estimate 

I Qn(f) - Q(f ) I < £ l a i n) l l/(4 n) ) -p(4 n) )l + I Qn(p) - Q{p ) I 

k = 0 



+ [ \p(x)~ f(x)\dx 



^ Ce e (b- a)e 

- 2 (C + b- a) + 2 + 2 (C + b-a) 

for all N > N(e); i.e., Q n (f ) Q(f) for n -> oo. 



□ 



A quadrature formula 



n 

Qn(f) = Yl a kf( x k) 

k=0 

defines a bounded linear operator Q n : C[a, b] -> 1R with the norm given 
by 

n 

IIQnllco = X>*l' ( 9 ‘ 13 ) 

k = 0 

To prove this, we note the estimate 

n 

IQn/|<||/||oo£M’ 

k—0 

which implies that Q n is a bounded operator and that the operator norm 
is less than or equal to the right-hand side of (9.13). Equality in (9.13) 
follows by choosing / to be a continuous piecewise linear function with 
ll/lloo = 1 and f{xk)dk = \dk\ for k = 0, ...,n. From (9.13) and the 
uniform boundedness principle, Theorem 12.7, it can be seen that the two 
conditions of Theorem 9.10 are also necessary for convergence of a sequence 
of quadrature formulae. 

Corollary 9.11 (Steklov) Assume that the sequence ( Q n ) of quadrature 
formulae converges for all polynomials and that all the weights are nonneg- 
ative. Then the sequence ( Q n ) is convergent. 

Proof. This follows from 

l a i n) l = £ a i n) = Qn(l) f dx = b- a, n -* oo, 

k = 0 k= 0 

and the preceding Theorem 9.10. □ 
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From Theorems 9.7 and 9.8 and Corollary 9.11 we observe that the com- 
posite trapezoidal rule and the composite Simpson’s rule are convergent. 
On the other hand, using the fact that the conditions of Theorem 9.10 are 
necessary for convergence, it can be shown that the Newton-Cotes quadra- 
tures do not converge for all continuous functions (see Problem 9.5). 



9.3 Gaussian Quadrature Formulae 



Given the arbitrary quadrature points x$,... ,x n in [a, 6], the quadrature 
weights ao, . . . , a n of a polynomial interpolatory quadrature are determined 
such that all polynomials of degree less than or equal to n are integrated 
exactly. In this section we will examine the problem of whether the quadra- 
ture points can be chosen in such a way that polynomials of degree less than 
or equal to 2n + 1 are also integrated exactly. Obviously, to achieve this 
degree of exactness the quadrature points and the quadrature weights have 
to satisfy the conditions 

n »b 

^2 a k^k— / x l dx, i = 0, . . . , 2n + 1. 
k = o a 



We shall see that this system of 2n + 2 nonlinear equations for the 2n + 2 
unknowns Xo, . . . , x n £ [a, b] and ao, . . . , a n E IR has a unique solution and 
that for this solution the points xq, • • • , x n are distinct. 

We shall proceed slightly more generally by considering quadrature for- 
mulae for the integral 



QU) := 



/ 



w(x)f(x) dx , 



(9.14) 



where w denotes some weight function. We assume that w : (a, b) — > JR 
is continuous and positive and that the integral J^w(x)dx exists. Typical 
examples are given by 



j(x) — 1, w(x) = yl — x 2 , w(x) = 



VT 



where for the two latter cases the interval is assumed to be [a, b] — [—1,1]. 
Analogously to the case w(x) — 1, interpolatory quadrature rules for (9.14) 
are obtained by replacing / through its interpolation polynomial L n f and 
then integrating exactly, i.e., by approximating Qf through 



Qn(f) := [ w(x)(L n f)(x)dx. 
J a 



Note that the separation of a weight function w for interpolatory quadra- 
ture formulae has the advantage that in general, wL n f is a better ap- 
proximation to wf than L n (wf) due to possible singularities of w and its 
derivatives at the endpoints of the interval. 
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Definition 9.12 A quadrature formula 

rb 



/ o 

w(x)f(x) dxfaY^ akf(xk) 
k = o 



with n + 1 distinct quadrature points is called a Gaussian quadrature formula 
if it integrates all polynomials p G P 2 n+i exactly , i.e., if 

n pb 

akpjxk ) = / w(x)p(x)dx (9.15) 

k = o a 



for all pe P 2 n+i • 

Lemma 9.13 Let xo , . . . ,x n be the n - 1-1 distinct quadrature points of a 
Gaussian quadrature formula. Then 



f 

J a 



w(x)q n+ i(x)q(x) dx = 0 



(9.16) 



for q n+ 1 (x) := (x - x 0 ) ■ ■ ■ (x - x n ) and all q e P n . 

Proof. Since q n +iq G P-in+x and q n + i (xk) = 0, we have that 

rb 



r° 

/ w(x)q n+ 1 ( x)q(x ) dx = ^ a k q n + 1 ( x k )q{x k ) = 0 

k = 0 



for all q £ P n . 



□ 



Lemma 9.14 Letx o, . . . , x n be n - hi distinct points satisfying the condition 
(9.16). Then the corresponding polynomial interpolatory quadrature is a 
Gaussian quadrature formula. 

Proof. Let L n denote the polynomial interpolation operator for the interpo- 
lation points #o, • • • ,x n . By construction, for the interpolatory quadrature 
we have 

n rb 

^2 a kf( x k)= w(x)(L n f)(x)dx (9.17) 

k = 0 

for all / G C[a, 6 ]. Each p G P 271+1 can be represented in the form 



p — L n p -F qn+iQ 



for some q G P n , since the polynomial p — L n p vanishes at the points 
#o, • • • 5 x n- Then from (9.16) and (9.17) we obtain that 



/ b pb n 

w{x)p{x)dx— / w(x)(L n p)(x) dx = E dkP(Xk) 
Ja k = 0 



for all p £ P 2n + 1 . 



□ 
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Lemma 9.15 There exists a unique sequence ( q n ) of polynomials of the 
form qo — 1 and 



q n {x) = x n + r n -i(x), n = 1,2,..., 
with r n _ i G P n - 1 satisfying the orthogonality relation 



/ w{x)q n (x)q m 
J a 


s 

e 

o 

II 


(9.18) 


P n = span{<7o, . - 


■ ,<?«}, n = 0, 1,.... 


(9.19) 



Proof. This follows by the Gram-Schmidt orthogonalization procedure from 
Theorem 3.18 applied to the linearly independent functions u n (x) := x n 
for n = 0, 1, . . . and the scalar product 

( f,9)-=[ w(x)f(x)g(x)dx 

for /, g G C[a, b\. The positive definiteness of the scalar product is a conse- 
quence of w being positive in (a, b). □ 

Lemma 9.16 Each of the orthogonal polynomials q n from Lemma 9.15 
has n simple zeros in (a, b). 

Proof. For m = 0, from (9.18) we have that 




dx — 0 



for n > 0. Hence, since w is positive on (a, &), the polynomial q n must 
have at least one zero in (a, b) where the sign of q n changes. Denote by 
a?i, . . . ,x m the zeros of q n in (a, b ) where q n changes its sign. We assume 
that m < n and set r m (x) := (x - x\) • • • (x — x m ). Then r m G P n - i and 
therefore 

/ w(x)r m (x)q n (x) dx = 0. 

J a 

However, this integral must be different from zero, since r m q n does not 
change its sign on (a, b) and does not vanish identically. Hence, we have 
arrived at a contradiction, and consequently m — n. □ 

Theorem 9.17 For each n — 0, 1, . . . there exists a unique Gaussian quad- 
rature formula of order n. Its quadrature points are given by the zeros of 
the orthogonal polynomial q n + 1 of degree n - hi. 



Proof. This is a consequence of Lemmas 9.13-9.16. 



□ 
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Theorem 9.18 The weights of the Gaussian quadrature formulae are all 
positive. 



Proof. Define 



fk(x) := 



Qn+l{ x ) 

[x - X k _ 



, k = 0, . . . , n. 



Then 



n 

akWn+iixk)] 2 = a jfk( x i) = / w(x)fk(x) dx > 0, 
*•— n ''a 



J=0 

since fk E P 2 n 5 and the theorem is proven. 



□ 



Corollary 9.19 The sequence of Gaussian quadrature formulae is conver- 
gent. 



Proof. For each polynomial p we have 

Qn(p) = / W(x)p(x) dx, 

J a 

provided that 2n + 1 is greater than or equal to the degree of p. From their 
proofs it is obvious that Theorem 9.10 and its Corollary 9.11 remain valid 
for the integral with the weight function w. Hence, the statement of the 
theorem follows from Theorem 9.18. □ 



Theorem 9.20 Let f E C' 2n+2 [a, b\. Then the error for the Gaussian 
quadrature formula of order n is given by 
b n 

w(x)f(x) dx — ^2 

k—0 

for some £ E [a, b\. 

Proof. Recall the Hermite interpolation polynomial H n f E fW+i for / 
from Theorem 8.18. Since (H n f)(xk) = f( x k ), k = 0, . . . , n, for the error 

/ b n 

w(x)f(x) dx - a k f{x k ) 
k = o 

we can write 

E n(f)=[ w(x)[f(x) - (H n f)(x)\dx. 

J a 

Then as in the proofs of Theorems 9.7 and 9.8, using the mean value the- 
orem we obtain 

„ f(z)-(H n f)(z) f b , „ , , l2j 

En(f) = [q n+1 (z )] 2 J w ( x )[Qn+i( x )] dx 

for some z E [a, b\. Now the proof is finished with the aid of the error 
representation for Hermite interpolation from Theorem 8.19. □ 



W2n+2)/£\ no 

a k f(xk)= >2n + 2)! J w ( x )[<ln+i(x)\ 2 dx 
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Example 9.21 We consider the Gaussian quadrature formulae for the 
weight function 

W(z) = -7r=f, [-Ml- 

The Chebyshev polynomial T n of degree n is defined by 

T n (x) := cos(narccosx), -1 < x < 1 . 

Obviously Tq(x) = 1 and T\(x) = x. From the addition theorem for the 
cosine function, cos(n + 1 )£ + cos(n — l)t = 2 cos t cos nt , we can deduce the 
recursion formula 



T n + i(s) + T n -\(x) = 2 xT n (x), n = 1 , 2 , .... 

Hence we have that T n £ P n with leading term 

T n (x) = 2 n ~ 1 x n H , n = 1,2, — 

Substituting x = cos t we find that 

n = m = 0 , 

I 

rl T n (x)T m {x) 



L 



V 1 — X 2 



dx 



-F 



cos nt cos mt dt = — , n — m > 0 , 

Zi 

0 , n 7 ^ m. 



Hence, the orthogonal polynomials q n of Lemma 9.15 are given by 
q n = 2 1 _n T n . The zeros of T n and hence the quadrature points are given 
by 

( 2k + 1 \ u n i 

Xk = COS I 7T 1 , k = 0, . . . , n - 1. 

The weights can be most easily derived from the exactness conditions 

Ea,r„( w )=/‘ 

I n j — \ 



k = 0 



a/ 1 — x 2 

for the interpolation quadrature, i.e., from 

n— 1 



da:, m = 0 , . . . ,n — 1 , 



k = 0 



COS 



(2A: -F l)m 
2n 



{ 7 r, m = 0 , 

0 , m = 1 , . . . ,n — 1 . 



From our analysis of trigonometric interpolation, i.e., from (8.19), we see 
that the unique solution of this linear system is given by 

7 r 

a k = - , k - 0 , . . . ,n - 1 . 
n 
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Hence, for n — 1,2,... the Gauss-Chebyshev quadrature of order n — 1 is 
given by 



i: 



/(*) 



dx i 



7T ^ / 2k+l \ 

»SH cos ^H' 



vT^~~ " 2n 

From Theorem 9.20 we have the error representation 

r»l t / \ n—1 



i: 



f(x) 



dx--\^f (cos 2k + 1 tA = *** 
vT-x 2 d n^J V 2n ) 2 2 "- 1 



7r / (2n) (0 

( 2 »)! 



for some £ € [-1, 1]. □ 

Example 9.22 We now consider the weight function 

w(:r) = 1, x e [-1, 1]. 

The Legendre polynomial L n of degree n is defined by 

1 d n 

L n (x) := (x 2 - l) n . 

v ; 2 n n! dx n v } 

Obviously, L n € P n . If m < n, by repeated partial integration we see that 



/ 



1 Jfl 



since (x 2 — l) n has zeros of order n at the endpoints -1 and 1. Therefore, 
r*l 



/ 



L n (x)L m (x) dx — 0, n^m. 



The zeros of the Legendre polynomials, and therefore the quadrature 
points and weights of the corresponding Gauss-Legendre quadratures , can- 
not be given explicitly by a simple expression. We consider only the cases 
n — 1 and n — 2 and note that 



Qo(x) = 1, qi(x)=x, q 2 (x) = x 2 - 



3 ’ 



where the coefficient of q 2 can be determined from < 72 ( 2 ) dx = 0. 

The quadrature point for the first Gauss-Legendre formula is x\ = 0, 
and the weight a\ can be obtained from the exactness condition 

d\ = J dx — 2. 

Hence the first Gauss-Legendre formula is given by 



L 



f(x) dx « 2/(0) 



(9.20) 
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with the error representation 



/. 



1 f(x)dx~2m = \f"(o 

i 6 



for some £ E [—1,1], The coefficient of the derivative on the right-hand 
side follows most easily by inserting f(x) = x 2 . For obvious reasons, this 
Gauss-Legendre formula is also known as the midpoint rule . 

The quadrature points for the second Gauss-Legendre formula are 
x\ = — 1/^/3 and X 2 — 1/a/ 3- The weights can be obtained from the exact- 
ness conditions 



Ul + 0>2 



CL\X\ + 02^2 



= J dx — 2, 

= J xdx — 0, 



and they have the values ai = a2 = 1. Hence the second Gauss-Legendre 
formula is given by 






with the error representation 



Jj(x)dx /(^) !35 /(4)(e) 

for some £ E [—1,1]. The coefficient on the right-hand side follows by 



inserting f{x) — x 4 . 

From the Gaussian quadrature formula 



□ 



/ I n 

g(z) akOi^k) 

1 k = 0 



of order n for the interval [—1,1], by substituting 



cl b b — CL 
x = — - — H — z 



2 2 

and f(x) = g{z) we obtain the Gaussian quadrature formula 

fb t{ x , & — CL -v . f CL H - b b — CL \ 

J f (x) dx a -y- + ~2~ xk ) 

for an arbitrary interval [a, b]. The error representation 

/ i n ^ o^ 2n ^(C) 

^ g{z) dz ~Yl a k9(x k ) = ( 2n + 2 )i J i fen+1 ( x)] 2 dx 
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with £ € [—1,1] can be transformed accordingly. Subdividing the interval 
[a, b] into m equidistant subintervals with step width h = (b — a)/m and 
then applying to each subinterval the Gaussian quadrature formula of order 
n, we obtain the composite Gaussian quadrature 

f b ,, , . h ^ A ./ h h \ 

/ f(x) dxtt ~ 2^Z^ afc ^ a + - ?/l+ 9 + o ** ) 

J a j = 0 fc=0 ' ' 

with an error of order 0(h 2n ). These composite Gaussian rules are used 
quite frequently in practice. We illustrate their convergence behavior by 
Table 9.3, which gives the error between the exact value of the integral 
from Example 9.6 and its numerical approximation by composite Gaussian 
quadrature of orders one and two. As predicted by our error analysis, if the 
number n of quadrature points is doubled, i.e., if the step size h is halved, 
then the error for the Gaussian quadrature of orders one and two is reduced 
roughly by the factor 1/4 and 1/16, respectively. 



TABLE 9.3. Gaussian quadrature for Example 9.6 



m 


n — 1 


n — 2 


1 


0.02648051 


0.00083949 


2 


0.00743289 


0.00007054 


4 


0.00192729 


0.00000489 


8 


0.00048663 


0.00000031 


16 


0.00012197 


0.00000002 


32 


0.00003051 


0.00000000 



9.4 Quadrature of Periodic Functions 



We proceed by deriving the Euler-Maclaurin expansion. 

Definition 9.23 The Bernoulli polynomials B n of degree n are defined 
recursively by Bq(x) := 1 and 



J3;:=fl n _i, n G IN, 
with the normalization condition 


(9.21) 


f B n (x) dx — 0, n e IN. 

Jo 


(9.22) 



The rational numbers 

b n := n\B n ( 0), n - 0, 1, . . . , 

are called Bernoulli numbers. 
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The first Bernoulli polynomials are given by 

5 0 (z) = l, B 1 (x) = x-^, B 2 (x) = ^x 2 ~^x + ^. 

We note that the normalization (9.22) is equivalent to 

B n (0) = B n (l), n = 2,3, — (9.23) 

Lemma 9.24 The Bernoulli polynomials have the symmetry property 

B n {x) = (— l) n J3 n (l -x), x £ JR, n = 0,1, (9.24) 

Proof. Obviously (9.24) holds for n = 0. Assume that (9.24) has been 
proven for some n > 0. Then, integrating (9.24), we obtain 

B n+ l(x) = ( — l)" +1 B n+1 (l - x) +/3 n+1 

for some constant /3 n +i- The condition (9.22) implies that /? n +i = 0, and 
therefore (9.24) is also valid for n + 1. □ 

Lemma 9.25 The Bernoulli polynomials Z? 2 m+i 5 to = 1,2,..., of odd 
degree have exactly three zeros in [0, 1], and these zeros are at the points 
0, 1/2, and 1. The Bernoulli polynomials to = 0, 1, ... , of even degree 
satisfy B 2m (0) ^ 0. 

Proof Prom (9.23) and (9.24) we conclude that i? 2 m+i vanishes at the 
points 0, 1/2, and 1. We prove by induction that these are the only zeros 
of I?2 m +i m [0? 1]- This is true for m = 1, since J5 3 is a polynomial of degree 
three. Assume that we have proven that £ 2m + 1 has only the three zeros 
0, 1/2, and 1 in [0, 1], and assume that JE? 2m + 3 has an additional zero a in 
[0,1]. Because of the symmetry (9.24) we may assume that a £ (0,1/2). 
Then, by Rolle’s theorem, we conclude that i? 2 m +2 has at least one zero in 
(0,a) and also at least one zero in (a, 1/2). Again by Rolle’s theorem this 
implies that B 2 m+i has a zero in (0, 1/2), which contradicts the induction 
assumption. 

Prom the zeros of B 2m + 1? hy Rolle’s theorem it follows that B 2m has a 
zero in (0, 1/2). Assume that B 2m (0) = 0. Then, by Rolle’s theorem, J5 2m _i 
has a zero in (0, 1/2), which contradicts the first part of the lemma. □ 

By B n : JR — > JR we denote the periodic extension of the Bernoulli 
polynomial B n \ i.e., B n has period 1 and B n (x) = B n (x) for 0 < x < 1. 
The Fourier series of the periodic functions B n are given by 

5 am (*) = 2(-ir-g^p (9.25) 

and 



(9.26) 
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for m — 1,2, This follows from (9.21) and (9.22) and the elementary 

Fourier expansion for the piecewise linear function B\ (see Problem 9.13). 

Let Xk = a + fcft, k = 0 , . . . , n, be an equidistant subdivision of the 
interval [a, b] with step size h — (b — a) /n and recall the definition of the 
trapezoidal sum 



T h (f) := h 



\ f(x o) + f(x i) 4- • • • + f(x n - 1 ) + ^ /( X n ) 



for / E C[a, b\. 



Theorem 9.26 Let f : [a, b} -> IR be m times continuously differentiable 
for m > 2. Then we have the Euler-Maclaurin expansion 



J a mdx = ThU) - E - / (2j_1) (a)] 

-i ) m h m J*B m (^) r m \x)dx , 



(9.27) 



+(• 



where [y denotes the largest integer smaller than or equal to 



Proof. Let g E C m [0, 1]. Then, by m — 1 partial integrations and using 
(9.23) we find that 



pl rn 

/ B 1 (z)g'(z)dz = -fl W_1) (0)] 

Jo 3=2 

-(-l) m t B m (z)gW(z)dz. 

Jo 

Combining this with the partial integration 

B 1 (z)g'(z)dz = ^[g(l) + g(0)]~ J g(z)dz 
and observing that the odd Bernoulli numbers vanish leads to 

i M 

J o 9{z)dz = ^ b(0) + 5(1)] - E (^7 [5 (2j_ 1) (1) -5 (2j_1) (0)] 

+(-1)"* C B m (z)g (m \z)dz. 

Jo 
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Now we substitute x = Xk + hz and g(z) = f(xk -1- hz) to obtain 

fXk+l u 

/ /(*) dx = - [/(x fc ) + f(x k+ 1 )] 

" X k 

[m.] 

+(-l )”fc“ jf** 1 B,„ <lc. 

Finally, we sum the last equation for k = 0, . . . , n - 1 to arrive at the 
Euler-Maclaurin expansion (9.27). □ 



For 27r-periodic continuous functions / : IR — > IR the trapezoidal rule 
coincides with the rectangular rule 

f 2n £ / \ j 2i A, /2 t rfc\ 

For its error 

*.(/) == f 

Jo n \ n J 

we have the following corollary of the Euler-Maclaurin expansion. 



Corollary 9.27 Le£ / : IR IR 6e (2m + 1) -tames continuously differen- 
tiable and 2n -periodic for m € IN and let n £ IN. Then for the error of the 
rectangular rule we have 

S~1 /*2 7T 

l E »(/)l £ i^TFT l l/ <S ” +,, WI*. 

where 

oo 1 

Proof From Theorem 9.26 we have that 

E ti{f ] = - (^) + f*' B 2m+ 1 (^) /< 2m+1 >(x) dx, 

and the estimate follows from the inequality 

00 1 

|#2m+i(a0| < 2 5Z (2 7 rA:) 2m + 1 ’ X G 1R ’ 
which is a consequence of (9.26). □ 
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Corollary 9.27 illustrates why for periodic functions the simple rectangu- 
lar rule is superior to any other quadrature rule (see Problem 9.12). Note 
that the rectangular rule can also be obtained by integrating the trigono- 
metric interpolation polynomials of Theorems 8.24 and 8.25. 

In the following theorem we give an example of derivative-free error esti- 
mates for numerical quadrature rules in the spirit of Davis [15]. They have 
the advantage that they do not need the computation of higher derivatives 
for the evaluation of the estimates. However, they require the integrand to 
be analytic, and their proofs need complex analysis. 



Theorem 9.28 Let f : 1R -» 1R be analytic and 2n -periodic. Then there 
exists a strip D — JR x (—a, a) C C with a > 0 such that f can be extended 
to a holomorphic and 2n -periodic bounded function /:!)—»€. The error 
for the rectangular rule can be estimated by 



\En(f)\ < 



47 tM 
e na - 1 ’ 



where M denotes a bound for the holomorphic function f on D. 



Proof. Since / : IR -» IR is analytic, at each point x e JR the Taylor 
expansion provides a holomorphic extension of / into some open disk in the 
complex plane with radius r(x) > 0 and center x. The extended function 
again has period 27r, since the coefficients of the Taylor series at x and 
at x + 2n coincide for the 27r-periodic function / : IR — > JR. The disks 
corresponding to all points of the interval [0, 27r] provide an open covering 
of [0, 27r]. Since [0, 27r] is compact, a finite number of these disks suffices to 
cover [0, 27r]. Then we have an extension into a strip D with finite width 
2 a contained in the union of the finite number of disks. Without loss of 
generality we may assume that / is bounded on D. 

From the residue theorem we have that 



rta-\-2n n * r-xa+2* n * 4 7T7 . ( 2ltk\ 

L cot T f{z)dz ~L "* T ' w ' * = -- £ ■ ' (— J 



for each 0 < a < a. This implies that 

l'ia+2n nz 27 T ^ ( 27rk\ 

Re£ , M -l(z)dz=- )• 

since by the Schwarz reflection principle, / enjoys the symmetry property 
f(z) = f{z). By Cauchy’s integral theorem we have 

/»*Q!+27r r2n 

Re / f(z) dz = I f(x) dx, 

J ia JO 

and combining the last two equations yields 

f ia-\-27r 



I'lOL-f-zn T)Z\ 

?n(f) = Re J (l - icot yj f(z) 



dz 
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for all 0 < a < a. Now the estimate follows from 



1 — i cot 



nz 

Y 



< 



e 



2 

na J 



for Im z = a and then passing to the limit a -» a. 



□ 



The estimate shows that for periodic analytic functions the rectangu- 
lar rule is of exponential order; i.e., doubling the number of quadrature 
points doubles the number of correct digits in the approximate value for 
the integral. 



9.5 Romberg Integration 

We now proceed with describing the extrapolation method due to Richard- 
son (1927). Its basic idea is to derive high-order approximation methods 
from simple low-order methods. It can be applied to a variety of formulae in 
numerical analysis, and its application to the Euler-Maclaurin expansion 
was suggested by Romberg in 1955. 

Recall the composite trapezoidal rule 

o + k h) + - f(b) 

k = 1 

with step size h = (b — a)/n. If / is four-times continuously differentiable, 
by the Euler-Maclaurin expansion from Theorem 9.26 we have an error 
representation of the form 

* f(x)dx = T£(f)+i l h 2 + 0(h 4 ) 

for some constant 71 depending on / but not on h. Hence, for half the step 
size, we have that 

jj{x)dx = Tl{f) + ll h t + 0{h i ). 

From these two equations we can eliminate the terms containing h 2 \ i.e., 
we multiply the first equation by —1/3 and the second equation by 4/3 and 
add both equations to obtain 

J* f(x) dx = | [4Tj(/) -Tlif)] +0(/i 4 ). 

Hence, the linear combination 
T 2 h {f) 



Tlif) :=h 
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of the composite trapezoidal rule with step sizes h and h/ 2 leads to a 
quadrature formula with the improved error order 0(h 4 ). The quadrature 
Tfi(f) coincides with the composite Simpson’s rule for the step size hf 2. 

If / is six- times continuously differentiable, by linearly combining the 
Euler-Maclaurin formulae for the step sizes h and hj 2 we obtain an error 
representation of the form 

f f(x)dx = TZ(f) +l2 h 4 + 0(h 6 ) 

J a 

for some constant 72 depending only on /. From this and the corresponding 
formula 

J ^ /(*) dx = T| (/) + 72 jg + 0(h 6 ) 

for step size hj 2 , by eliminating the terms containing h 4 we obtain the 
quadrature formula 

Tft3(/) := ^[ 163 1 (/)_7 * (/) ] 

with an error of order 0(h 6 ). Note that the actual numerical evaluation of 
T%(f) requires the values for the composite trapezoidal rule for the step 
sizes h, h/ 2 , and h/ 4. 

Obviously, this procedure can be repeated, and this leads to the sequence 
of Romberg quadrature formulae. Let 

r fc 1 (/):=T h 1 fc (/), * = 0,1,2,..., 

be the trapezoidal sums for the step sizes hk := h/2 k . Then for m — 1,2,... 
the Romberg quadratures are recursively defined by 

T T + \f) == [ 4 mT M-i(/) - W)] , k = 0, 1, ... . (9.28) 

For the error we have the following theorem. 

Theorem 9.29 Let f : [a, b\ IR be 2m -times continuously differentiable. 
Then for the Romberg quadratures we have the error estimate 

r b / h \ 2m 

j f(x)dx-T?(f) , * = 0,1,..., 

for some constant C m depending on m. 



Proof. By induction, we show that there exist constants 7 ^ such that 








2m 
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for i = 1, . . . , m and k = 0, 1, Here the sum on the left-hand side is set 

equal to zero for i = m. By the Euler-Maclaurin expansion this is true for 
i — 1 with 7^1 = b 2 j / (2 j)\ for j — 1 , . . . , m — 1 and 

7m, i = (b — a) sup \B 2m (x)\. 
xe[o,i] 



As an abbreviation we set 



Fj := f {2j l) {b)-f (2} x \a), j = 

Assume that (9.29) has been shown for some 1 < i < m. Then, using (9.28), 
we obtain 



4® — 1 



f f(x) dx — 
J a 



k+1 



m_1 / h \ 2j 



4* — 1 



r b m_1 / h \ 2j 

j'fWdx-nm-'Ey#) 



J=i 

m-l / \ 2 j 



/ b / h\ z J 

f{x)dx-T k + \f)- £ (j*) lj,i+iFj, 



where 



4^-1 ... 

7j,t+i = 41 _ 1 J = 1 + 1’ ’ ‘ * ’ m “ 1* 



Now with the aid of the induction assumption we can estimate 




m— 1 

dx-Ti +1 (f) - £ 7i,<+i 

j=i+l 




<7m, i+1 ||/ (2m) ||c 



(*) 



2m 



where 



7m,i-f 1 



and the proof is complete. 



41 ~m _|_ j 

4® - 1 



□ 



From Theorem 9.29 we conclude that the Romberg quadrature T™ in- 
tegrates polynomials of degree less than or equal to 2 m — 1 exactly. For 
h — b-a the Romberg quadrature Tg 1 uses 2 m ~ 1 + 1 equidistant integration 
points. Therefore, Tq coincides with the trapezoidal rule, Tq with Simp- 
son’s rule, and Tq with Milne’s rule. Similarly, T£, T|, and T% correspond 
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to the composite trapezoidal rule, the composite Simpson’s rule, and the 
composite Milne rule, respectively. For m > 4 the number of the quadrature 
points in T™ is greater than the degree of exactness. The Romberg formula 
Tq uses nine quadrature points, and this is the number of quadrature points 
where the Newton-Cotes formulae start having negative weights. 

Theorem 9.30 The quadrature weights of the Romberg formulae are pos- 
itive. 

Proof. We define recursively Q\ 4 X£ +1 — 2 T k and 

Q? +l := i^rr [ 22m+lr *+i + 2J T + im+1 QT + i} (9-30) 

for k = 1, 2 , . . . and m — 1,2... and show by induction that 

T k +1 = ^Zj[Tk+Qn (9-31) 

By the definition of Q\ this is true for m = 1. We assume that (9.31) has 
been proven for some m > 1. Then, using the recursive definitions of T™ 
and Q ™ and the induction assumption, we derive 

4_m+l 1 

T jn+1 + QT+ 1 = __ [T rn +i + Q m J _ __ ^ 



_ 1 rpm+l _ rpm - fl _ ^m-fl _ ^pm+2 . 



i.e., (9.31) also holds for m + 1. Now, from (9.30) and (9.31), by induction 
with respect to m, it can be deduced that the weights of T™ are positive 
and that the weights of Q™ are nonnegative. □ 

Corollary 9.31 For the Romberg quadratures we have convergence: 

b pb 

f(x)dx and lim T™(f) = / f(x)dx 

k^OO J a 

for all continuous functions f. 



lim 



T? 



(/) 



L 



Proof. This follows from Theorems 9.29 and 9.30 and Corollary 9.11. □ 



For continuous functions, the trapezoidal sums converge as the step size 
tends to zero. This motivates us to consider a polynomial in h 2 interpolating 
the values T£(/), . . . ,T£ +m (/) at the interpolation points h\,..., h 2 k+m and 
evaluate it at ft = 0. 

Theorem 9.32 Denote by L™ the uniquely determined polynomial in ft 2 
of degree less than or equal to m with the interpolation property 

LT(h 2 j )=T}(f ), j = k,...,k + m. 
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Then the Romberg quadratures satisfy 

T? +1 (f) = Lf{ 0). (9.32) 

Proof. Obviously, (9.32) is true for m — 0. Assume that it has been proven 
for m — 1. Then, using the Neville scheme from Theorem 8.9, we obtain 

^(o) = , 2 *2 i-hiLT^m+hi^L^m 

ri k+m n k 

= U2 l _ h 2 [-hlT^i+h\ +m Tn 

rt k+m fl k 

= 4 ^ [4mTfc + 1_Tfcm]=rr+1 ’ 

establishing (9.32) for m. □ 

This interpretation of the Romberg quadrature as an extrapolation method 
in the sense of Richardson opens up the possibility of modifications using 
other than equidistant step sizes. 

Table 9.4 gives the error between the exact value of the integral from 
Example 9.6 and its numerical approximation by the Romberg quadrature, 
exhibiting its fast convergence according to the error estimates of Theorem 
9.29. Clearly, the first two columns of Table 9.4 have to coincide with Table 
9.2. 



TABLE 9.4. Romberg quadratures for Example 9.6 



k 


T 1 

L k 


n 


n 


n 


1 

2 
4 
8 

16 

32 


-0.05685282 

-0.01518615 

-0.00387663 

-0.00097467 

-0.00024402 

-0.00006103 


-0.00129726 

-0.00010679 

-0.00000735 

-0.00000047 

-0.00000003 


-0.00002742 

-0.00000072 

-0.00000001 

-0.00000000 


-0.00000030 

-0.00000000 

-0.00000000 



We finish this section with the corresponding Table 9.5 for the integral 




(9.33) 



of a function that is not differentiable in all of the integration interval. Not 
surprisingly, the convergence is notably slower. 
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TABLE 9.5. Romberg quadratures for the integral (9.33) 



k 


T 1 
1 k 


T2 


n 


Tl. 


Tl 


i 


0.166667 










2 


0.063113 


0.028595 








4 


0.023384 


0.010140 


0.008910 






8 


0.008536 


0.003587 


0.003151 


0.003059 




16 


0.003085 


0.001268 


0.001114 


0.001082 


0.001074 


32 


0.001108 


0.000448 


0.000394 


0.000382 


0.000380 



9.6 Improper Integrals 



We conclude this chapter with an example for the numerical integration of 
improper integrals and describe a class of quadrature rules for the integral 




dx 



where the integrand / is sufficiently smooth in (0, 1) but is allowed to have 
singularities at the endpoints x — 0 and x = 1 such that / is nonetheless 
integrable. 

Let the function w : [0, 27r] -» [0, 1] be bijective, strictly monotonically 
increasing, and infinitely differentiable. Then we can substitute x = w(t) 
and consequently obtain 





dt , 



where 

g(t) := w f (t) f(w(t )), 0 < t < 2 tt. 

Now assume that the function w has derivatives 



w W> (0) = (2 tt) = 0, j = 1 , . . . , p - 1 , (9.34) 

and 

w (p )(0) ^ 0, w ip) (2n) ± 0 (9.35) 

for some p G IN. Then we may expect that the function g and some of 
its derivatives up to a certain order vanish at t = 0 and t = 27 r; i.e., g 
can be considered as a sufficiently smooth 27r-periodic function, and the 
rectangular rule may be applied to the transformed integral. This yields 
the quadrature formula 



L 



1 71—1 

f(x)dx ss ^ 2 a kf(xk ) 

fc=i 



(9.36) 
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with the quadrature points and weights given by 



(2nk\ 2?r ,(27rk\ 

x k — w [ , a,k = — w , fc = l,...,n-l. 

V n ) n \ n ) 

In addition, it is natural to require the symmetry property 

w(t) = 1 - w( 2tt - t), te [0, 2tt]. (9.37) 

Then the quadrature points and weights have the symmetry 

•En—k 1 •Ek'i Q>n—k — ^ — 1 , . . . , 71 1 , 

and from the assumptions (9.34) and (9.35), by Taylor’s formula, it follows 
that they satisfy the inequalities 

co 1 — x n—k < c\ ^ , & = !,..., J , (9.38) 



/ f^\p—L / k\P~ fl 

co ( — ) < a k , a n _ k < ci ( -J , A: = 1, — , , (9.39) 



for some constants 0 < Co < c\ depending on the function w. From (9.38) it 
is obvious that the quadrature points are graded towards the two endpoints 
x — 0 and x — 1 of the integration interval. 

For substitutions with the properties (9.34), (9.35), and (9.37), from the 
Euler-Maclaurin expansion applied to the integral over g we now will derive 
an estimate for the remainder term 

EnU) •= [ f( x ) dx — V' a k f{xk)- 

'' 0 *=i 

For q G IN and 0 < a < 1 by S q ' a we denote the linear space of g-times 
continuously differentiable functions / : (0, 1) 1R for which 

sup [x(l - x)] j + 1 ~ a \ f^(x)\ < oo 

0<a;<l 

for j = 0, . . . , q. On S q,a we define the norm 

ll/ll?, a := max sup [x(l - x)P +1-a |/ (j) (x)|. 
o<x<l 

Then, clearly 

l/ W (*)l< ll/ll,, a[*(l-*)] a " ,_1 , 0 < X < 1, (9.40) 



for j = 0, . . . , q. 
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Theorem 9.33 Letp G IN and assume thatw satisfies (9.34), (9.35), and 
(9.37). Further, let q £ IN and f G S 2q + l,a with 0 < a < 1 such that 

2q + 1 < ap and 2q + 2 < p. 

Then the error in the quadrature formula (9.36) can be estimated by 

\ E n(f)\<^Ti\\fh q+ l, a 

with some constant C depending on w, a, and q. 

Proof. For the derivatives of g we can write 



9 {r) (t) = £ «$(*) f U) M*)), r = 0, . . . , 2q + 1. 

j = 0 



Then from 



s (r+1) W = £ 

j = 0 L 



du r (t ) 

u r At) w'{t) / (j+1) (w(()) + 3 / (j) («;(<)) 



we derive the recursion formulae 

f 



u r j + 1 (t) = < 



dt ’ 

u r j _ 1 (t)w'(t)+ dt , 



dt 



3=0 , 



duj(t) 



J = r + 1, 



(9.41) 



for the coefficients uj. In particular, we have 

uS(£) = U7^ r+1) (0 and uj!(0 = [ti; / (0] r+1 . (9.42) 

The functions Uj satisfy 

Uj(t) = 0([t(2n - t )]) z * , t(2n - t) -» 0, (9.43) 

for r = 0, . . . , 2q + 1 and j = 0, . . . , r, where 



2/ = p - 1 -f jp - r. 

For j = 0 and j = r this is obvious from the assumption on w and (9.42), 
and for j = 1, . . . , r — 1 it follows by induction from the recursion formulae 
(9.41). Note that Zj > 0 because of the assumption p > 2q + 2. 

Using (9.40) and the assumptions on w , we can estimate 

l/ 0) M*))l < Cill/I^+U^TT - 0] (a - J " 1)p 
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for some constant Ci, and with the aid of (9.43), we further obtain that 

K(*)/ W) Mf))l < C 2 \\f\\ 2q+l<a [t{^ - 0] ap “ r_1 , 0 < t < 2tt, (9.44) 

for some constant C 2 and r = 0, . . . , 2q + 1 and j = 0, . . . , r. From this, 
since ap > 2q + 1 , we observe that for r = 0, . . . , 2q the derivatives g ^ can 
be continuously extended from (0, 2n) onto [0, 2n] with values 

ff (r) (0)=5 (r) (27r) = 0, r = 0, . . . ,2q. 

Furthermore, from (9.44) and the assumption ap > 2q 4- 1 we see that the 
integral of g( 2q + l ) over [0,27r] exists as an improper integral and 

/»27T 

/ \9 i2lI+1 Ht)\dt<C 3 \\f\\ 2<I+1 , a 

Jo 

with some constant C3 depending on w, a, and q. Now the statement fol- 
lows from Corollary 9.27 of the Euler-Maclaurin expansion. Note that for 
the Euler-Maclaurin expansion (9.27) to be valid it obviously suffices that 
the integral of the error term exists as an improper integral. □ 



We proceed by describing a few examples for substitutions w (see Prob- 
lem 9.19). In 1963 Korobov suggested the polynomial transformation 



‘ /*27T " rt 

w p (t) := / [s(2tt - s)] p-1 ds / [s(27r - s)] p-1 ds . 

[Jo J Jo 



The trigonometric transformation 
r c 2 tt 



w p (t) := 



with the special cases 



[ sin p_1 ^ [ sin p_1 ^ 

Jo 2 J Jo 2 



ds 



(9.45) 



(9.46) 



W\ (t) = , W 2 (t) = I - COS 0 , w 3 (t) = sin t) 

was proposed by Sidi [54]. Substitutions of the form 

t p 



w p (t) := 



t p + (2 tt - t)P 



(9.47) 



were considered in [40]. As a rule of thumb, these substitutions should not 
be used for p too large, say p > 10, because this may lead to overgrading 
and numerical difficulties with underflow. The substitutions 
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and 



w(t) = 




27 r 
2t r- 






with zeros of infinite order at the endpoints, which were suggested by Iri, 
Moriguti, and Takesawa [58] and by Sag and Szekeres [52], respectively, 
also suffer from this drawback. 

As a numerical example we consider the improper integral 



L 



— dx = 2. 
0 y/x 



(9.48) 



Table 9.6 gives the error between the exact value and the numerical ap- 
proximation obtained by using the substitution (9.47). 



TABLE 9.6. Numerical quadrature for the integral (9.48) 



n 


P = 3 


p — 4 


p = 5 


p — 6 


8 


0.07012542 


-0.06064201 


-0.22007377 


-0.42795942 


16 


0.02849925 


0.00455233 


-0.00438402 


-0.01896018 


32 


0.00992273 


0.00129852 


0.00011279 


-0.00003394 


64 


0.00347755 


0.00032530 


0.00002117 


-0.00000019 


128 


0.00122386 


0.00008137 


0.00000382 


-0.00000001 



Problems 



9.1 Show that the error for the composite trapezoidal rule can be expressed in 
the form 



/ 



f{x)dx - T h (f) 



f 

J a 



Kr{x)f" (a?) dx , 



where the so-called Peano kernel Kt is given by 



Kt(x) = - (x - X k -l){Xk ~ x), Xk-l < x < x k , 



for k = 1, . . . , n. Use this error representation for an alternative proof of Theorem 
9.7. 



9.2 Show that the error for the composite Simpson’s rule can be expressed in 
the form 




dx - Sh(f) = - 



f 

J a 



Ks{x)f 



( 4 ) 



(x) dx, 




222 



9. Numerical Integration 



where the Peano kernel Ks is given by 



K s (x) := 



(x - Xk-2) 3 - ^7 (x - Xk-2) 4 , Xk-2 <x< Xk-1, 



TS (xt - x) 3 - (x* - x) 4 , 



Xk -1 < X < Xk, 



for A; = 2, 4, . . . , ra. Use this error representation for an alternative proof of The- 
orem 9.8. 



9.3 For Newton’s three-eights rule prove the error representation 



[ b f(x 

J a 



)dx- b -^ [f(a) + 3/(a + h) + 3/(6 - ft) + /(6)] = 



with some £ in [a, b] and h = (b — a)/ 3. 

9.4 Show that the weight a\ for the Newton-Cotes formula of order eight is 
negative. 

9.5 For the remainder E n of the Newton-Cotes formula of order n on the in- 
terval [—1,1], applied to the Chebyshev polynomial T n + 1 , show that 

From this conclude that if n odd, then 

ll-E'nlloo > |£7„(T„ + i)| > 7n, 



(n — 1)! 4 n+1 

771 = 3n n+ i > °°» n ^°°- 

Hint: Use Theorem 8.10 and show that 






for n odd. 



9.6 Compute the weights for the polynomial interpolatory quadratures with 
equidistant quadrature points 

x k = a + (k 4- 1) % , fc = 0, 1, . . . , n, 

n-\- 1 



for n = 0, 1, 2 and obtain representations of the quadrature errors. These formulae 
are called open Newton-Cotes quadratures , since the two endpoints a and b are 
omitted. 
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9.7 For n E IN, a quadrature formula of the form 



/ 



fix) dx : 



b — a 






with distinct quadrature points xi, . . . ,x n € [a, b] and equal weights is called a 
Chebyshev quadrature if it integrates polynomials in P n exactly. Find the Cheby- 
shev quadratures for n = 1 , 2 , 3, 4. (Chebyshev quadratures exist only for n < 8.) 

9.8 Show that there exists no polynomial interpolatory quadrature of order n 
that integrates polynomials of degree 2n + 2 exactly. 

9.9 The Chebyshev polynomial of the second kind U n of degree n is defined by 
sin ((n + 1) arccos x) 



Unix) := 



-1 < x < 1. 



sin (arccos x) 

Show that Uo{x) — 1, U\{x) = 2x, and 

U n +l{x) + U n -l(x) = 2 xUn(x), U = 1, 2, . . 
Prove the orthogonality relation 






\Z\-X 2 U n (x)U m (x) dx = ^ S nm • 



9.10 Show that the quadrature points and quadrature weights for the Gauss- 
Chebyshev quadrature of order n — 1 for the integral 

\/l — x 2 f(x) dx 

-l 



/; 



are given by 



and 



for k — 0, . 



fc + 1 

Xk = cos — 7T 

n + 1 



0>k 



7 r . o & + 1 

sin 7 r 



n+1 



n + 1 



, 71 — 1 . 



9.11 Find the quadrature weights ao, oi, a 2 , 03 , and the (remaining) quadrature 
points xi,X 2 of a quadrature formula of the form 

rl 

fix) dx « a 0 /(— 1) + aifixi) + a 2 /(x 2 ) 4- 03 /( 1 ) 






that is exact for all polynomials in P$. (This is an example of a Gauss-Lobatto 
quadrature , i.e., a Gauss quadrature with two preassigned quadrature points.) 

Find the quadrature weights ao,ai,a 2 , and the (remaining) quadrature points 
xi,x 2 of a quadrature formula of the form 



/: 



fix) dx « a 0 /(— 1) + ai/(xi) + a 2 /(x 2 ) 



that is exact for all polynomials in Pa. (This is an example of a Gauss-Radau 
quadrature , i.e., a Gauss quadrature with one preassigned quadrature point.) 
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9.12 By approximating the integral 



L 



2t r 



5 



JL 

4 cos# 



dx = 



2n 

T 



by the rectangular rule and Simpson’s rule convince yourself of the superiority of 
the rectangular rule for periodic functions. 



9.13 Verify the Fourier series (9.25) and (9.26) for the periodic Bernoulli poly- 
nomials. 



9.14 For the Bernoulli polynomials show that the series 



Bn(x)t n 

71 = 0 



te xt 



e l — 1 



is absolutely and locally uniformly convergent for all x € [0, 1] and all t G (—1,1). 



9.15 Derive a quadrature formula by integrating the interpolating cubic spline 
from Theorem 8.30 and discuss its relation to the Euler-Maclaurin expansion. 



9.16 Write a computer program for Romberg integration and test it for various 
examples. 

9.17 Calculate the weights of the Romberg quadratures T% and T^. 

9.18 Show that the Richardson extrapolation for the midpoint rule (9.20) leads 
to nonnegative quadrature weights. 



9.19 Show that the functions (9.45), (9.46), and (9.47) are strictly monoton- 
ically increasing, infinitely differentiable, and map [0, 2i r] onto [0, 1] such that 
(9.34), (9.35), and (9.37) are satisfied. 

9.20 Write a computer program for the numerical quadrature (9.36) using the 
substitution (9.47) and test it for various examples. 




10 

Initial Value Problems 



Historically, the study of differential equations originated in the beginnings 
of calculus with Newton and Leibniz in the seventeenth century and is 
closely interwoven with the general development of mathematics. To a sub- 
stantial degree, the central role of differential equations within mathematics 
is due to the fact that many important problems in science and engineering 
are modeled by differential equations. 

This chapter will be devoted to an introduction to the basic numerical 
approximation methods for initial value problems for ordinary differential 
equations. For a more comprehensive study we refer to [13, 33, 42, 46, 55]. 
Analogous to the need for numerical quadrature formulas, numerical meth- 
ods for the approximate solution of ordinary differential equations are nec- 
essary, since in general, no explicit solutions of the differential equation 
will be known, despite the fact that there exists a broad range of analyt- 
ical solution methods for special classes of ordinary differential equations. 
In addition, the functions and data involved in the differential equation 
problem quite often will be available only at discrete points. However, we 
would like to emphasize that despite the availability of numerical methods 
the study of elementary analytical methods for the solution of ordinary 
differential equations remains worthwhile, since it provides a first step into 
gaining insight into the general structure of differential equations. 

A solid foundation for numerical approximation methods for differen- 
tial equations, including their convergence and error analysis, requires as 
a prerequisite results on the existence and uniqueness of the solution to 
the problem to be approximately solved. Therefore, in Section 10.1 we will 
begin with proving the fundamental Picard-Lindelof existence and unique- 
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ness theorem for initial value problems. In Section 10.2 we will describe 
some variants of the simplest method for the numerical solution of initial 
value problems, which was first used by Euler. These methods are special 
cases of so-called single-step methods, for which we will give a convergence 
and error analysis in Section 10.3. This section also includes a short dis- 
cussion of the Runge-Kutta method as the most widely used single-step 
method. The final section, Section 10.4, is concerned with the description 
and analysis of multistep methods. 

We wish to note explicitly that this chapter is also meant to serve as 
an application of some of the material provided in Chapters 8 and 9 on 
interpolation and numerical integration. 



10.1 The Picard-Lindelof Theorem 

Definition 10.1 Let G C H 2 be a domain and f : G — > Hi. A continuously 
differentiable function u : [a, b] -» IR is called a solution of the ordinary 
differential equation of the first order 

u' - f(x,u) (10.1) 

if (x,u(x)) G G and u'{x) — f(x,u(x)) for all x G [a, b\. 

Geometrically speaking, the differential equation (10.1) defines a field of 
directions on G. Solving the differential equation means looking for func- 
tions whose graphs match this field of directions. 

Systems of ordinary differential equations can be included in the dis- 
cussion as follows. If G C IR n+1 is a domain and / : G IR n , then a 
continuously differentiable function u : [a, 6] — > IR n is called a solution of 
the system of ordinary differential equations of the first order 

u! — f(x,u) 

if (x,u(x)) € G and u*(x) = f(x,u(x)) for all x G [a, b]. More explicitly, 
this system reads 

u[ = f\ (x, U\ , . . . ,w„) 

*4 = /2(s, ui, •••,«*„) 



u' n = / n (x,Wi,...,U n ). 

for u — (wi, . . . ,u n ) T and / = (/i, . . . , / n ) T - Each ordinary differential 
equation 

U (n) =/(*,u,u',...,ti (n - 1) ) 
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of order n is equivalent to the system 



u x = u 2 , u 2 = u 3 , 



*n-l 



= Un, u' n = f n (x,Ui,...,U n ) 



via ui = u, u 2 — u',...,u n = Therefore, in principle, considering 

only differential equations of the first order is no loss of generality. 

From the wide field of applications we sketch only the following two 
simple examples. 

Example 10.2 By Newton’s law, the differential equation of the second 
order 

mu” — f(t,u) 

describes the motion of an object of mass m subject to the external force 
f(t,u) depending on the location u of the object and the time t. Given an 
initial location uq and an initial velocity u f Q at the initial time t — 0, one 
wants to find the position u(t) of the object for all times t > 0. □ 

Example 10.3 Let p~p{t) describe the population of a species of animals 
or plants at time t. If r(t,p) denotes the growth rate given by the difference 
between the birth and death rate depending on the time t and the size 
p of the population, then an isolated population satisfies the differential 
equation 

dp 
dt 



= r(t,p). 

The simplest model r(t,p) = ap , where a is a positive constant, leads to 



dp 

dt 



ap 



with the explicit solution p(t) = poe a ^ to ). Such an exponential growth is 
realistic only if the population is not too large. The modified model 

dP i 2 

with positive constants a and b contains a correction term that slows down 
the growth rate for large populations and is known as the Verhulst equa- 
tion. It was introduced by Verhulst in 1938 as a model for the growth of 
the human population. In general, for a given growth rate r one wants to 
determine the development of the population p(t) in time for a given initial 
population p 0 at time t = t 0 . □ 

Both examples are typical initial value problems: Find a solution of a 
differential equation that attains a given initial value at a given initial 
time. This notion is made more precise by the following definition. 
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Definition 10.4 The initial value problem for the ordinary differential 
equation 

u l = f(x,u) (10.2) 

consists in finding a continuously differentiable solution u satisfying the 
initial condition 

u(x 0 ) - u 0 (10.3) 

for a given initial point xq and a given initial value uo . 

The existence and uniqueness of a solution to such an initial value prob- 
lem are settled through the following fundamental theorem. 

Theorem 10.5 (Picard-Lindelof) Let G E IR n+1 be a domain and let 
f : G -* ]R n be a continuous function satisfying a Lipschitz condition 

\\f{x,u) - f(x,v)\\ < L\\u - 1 >|| (10.4) 

for all (x,u), (x,v) € G and some constant L > 0, which is called the 
Lipschitz constant. Then for each initial data pair (#o,uo) € G there exists 
an interval [xq — a, x o + a] with a > 0 such that the initial value problem 
(10.2)-(10.3) has a unique solution in this interval. 

Proof Firstly, we transform the initial value problem equivalently into the 
Volterra integral equation 

u{x)=u Q + f f(t,u{0)d£. (10.5) 

J Xq 

Clearly, if u solves the initial value problem, then it follows by integrating 
the differential equation and using the initial condition that u also solves 
the integral equation. Conversely, if u is a continuous solution of the integral 
equation, then by differentiating the integral equation it follows that u is 
continuously differentiable and satisfies the differential equation. Inserting 
x = xo in (10.5) shows that the initial condition is fulfilled. 

For solving the Volterra integral equation we now can employ Banach’s 
fixed point Theorem 3.45. Since G is open, we can choose a bounded domain 
D such that (xo 5 w 0 ) E D and D C G. Denote by M a bound on the 
continuous function f : D IR n ; i.e., 

\\f(x,u)\\ < M, (x,u) 6 D. 

Since D is open, we can choose a > 0 such that the closed rectangle 

B := {(#, u) E IR n+1 : \x — #o| < ||w - Moll < Ma } 

is contained in D. Consider the Banach space C[xq — a, xo -ha] of continuous 
functions u : [xq — a, Xo + a] — > IR n furnished with the maximum norm 

IMIoo ;= , max ||w(x)|| 

\x—xo\<a 
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in terms of the chosen norm || • || on lR n . Each solution u of the integral 
equation satisfies (see (6.1)) 



H*) - noli = [* mMO)# 

Jxo 



< Ma , \x — #o| < a, 



that is, 



| u - u 0 ||oo < Ma, 



which implies that the solution remains within the rectangle B. We consider 
the closed subset 



U := {u E C[x o — a,Xo + a] : \\u - u 0 ||oo < Ma} 

of the Banach space C[x o — a, x o + a] and note that by Remark 3.40 the 
set U is complete. On U we define an operator A : U U by setting 

(Au)(x) := u 0 + f /(^, u(0) k “ x o\ < a. 

j Xo 

The operator A indeed maps U into itself, since the function Au is con- 
tinuous and satisfies || Au — uo||oo < Ma. With the aid of the Lipschitz 
condition (10.4) and using (6.1) we can estimate 

ll(4«)0c) - (4«)(a;)|| = II f [f{£ MO) ~ f(0v(0)}d£ 

\\JXQ 

ll«(0 - u(0|| d£ < La\\u - ulloo 




for all \x — xo| < a. Hence 

|| Au - Av ||oo < La\\u - v||oo 

for all u,v e U. Now we choose a such that a < l/L. Then A : U -4 U is a 
contraction operator, and the Banach fixed point theorem ensures a unique 
fixed point of A , i.e., a unique solution of the integral equation (10.5) in 
the interval [xq — a, xo + a]. □ 

Exploiting the fact that in Theorem 10.5 the width a of the interval 
is determined by the Lipschitz constant L, which is independent of the 
initial point (xo,tto), one can assure global existence of the solution; i.e., 
the solution to the initial value problem exists and is unique until it leaves 
the domain G of definition for the differential equation. 

Note that on a convex domain each function that is continuously dif- 
ferentiable with respect to u satisfies a Lipschitz condition (see the mean 
value Theorem 6.7). 
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Corollary 10.6 Under the assumptions of Theorem 10.5, the sequence 
( u u ) defined by uq(x) — uq and 

u„+i (x) := u 0 + f f(Z,u v (Q)dt, |a;-a:o|<a, i/ = 0,l,..., (10.6) 
J Xo 

converges as v -» oo uniformly on [xq — a, xq + a] to the unique solution u 
of the initial value problem. We have the a posteriori error estimate 



\U - UvWm < 



l -La 



\u v U u — 1 1 1 oo ? ^ — 1,2,.... 



Proof. This follows from Theorem 3.46. □ 

Example 10.7 Consider the initial value problem 
u 1 = x 2 + u 2 , u( 0) = 0, 

onG = (—0.5, 0.5) x (—0.5, 0.5). For f{x,u) := x 2 +u 2 we have 

\f{x,u)\ < 0.5 

on G. Hence for any a < 0.5 and M — 0.5 the rectangle B from the proof 
of Theorem 10.5 satisfies B C G. Furthermore, we can estimate 

| f(x,u) — f{x,v ) | — \u 2 — v 2 \ = \(u + v)(u — v)\ < \u — v\ 

for all ( x , u), (x, v) € G; i.e., / satisfies a Lipschitz condition with Lipschitz 
constant L — 1. Thus in this case the contraction number in the Picard- 
Lindelof theorem is given by La < 0.5. 

Here, the iteration (10.6) reads 



u v +i[x) = 



(x) = f [£ 2 + «£(£)]«£• 

Jo 



Starting with uo(a:) =0 we first compute 



wi(®) = / = “TT , 



l(x) = [ 
Jo 



and from Corollary 10.6 we have the error estimate 

l|« - Wllloo < IK - «o||oo = ^ = 0.041 .... 



The second iteration yields 



/ \ f x r,2 £ 6 i j/- x 3 

Mx) -J [« + -9 ^ - y + eg 
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with the error estimate 



||« - U2II00 < ||«2 - «l||oo 



1 

63 -2 7 



0 . 00012 . . . , 



and the third iteration yields 

rx r ^6 O<10 £14 1 9^11 7 . 15 

“ 3 ^ = J 0 + 9* + 189 + 3969. ^ = T + 63 + 2079 + 59535 
with the error estimate 

11“ - S «“» - “'ll- = ambiS + 5953^ = 0.00000047 .... 

In this example three steps of the Picard-Lindelof iteration give eight dec- 
imal places of accuracy. However, the example is not typical, since in gen- 
eral, the integrations required in each iteration step will not be available 
explicitly as in the present case. □ 



10.2 Euler’s Method 

In the sequel we confine our presentation to the initial value problem for a 
differential equation of the first order. The generalization to systems and 
henceforth to equations of higher order is straightforward. We shall always 
tacitly assume that the assumptions of the Picard-Lindelof Theorem 10.5 
are satisfied. 

The following simple method for the numerical solution of the initial 
value problem 

u' = /(x,m), u(x 0 ) = u 0 , (10-7) 

was first used by Euler. Given a step size h > 0, it consists in replacing the 
derivative u ' = f(x,u) throughout the interval [xo,xo + h\ by the derivative 
Uq = f(xo,uo) at the initial point, i.e., geometrically speaking, by replacing 
the solution by its tangent line at the initial point x$. This leads to the 
approximation 

ui = u 0 + hf(xo,u 0 ) (10.8) 

for the value u(x i) of the exact solution at the point x\ — xq + h. Repeating 
this procedure leads to the Euler method as described in the following 
definition. For obvious reason, this method is also known as the polygon 
method , since it approximates the exact solution curve by a polygon. 

Definition 10.8 The Euler method for the numerical solution of the ini- 
tial value problem (10.7) constructs approximations Uj to the exact solution 
u(xj) at the equidistant grid points 



Xj := x 0 + jh, j = 1 , 2 ,..., 
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with step size h by 

Uj + 1 := Uj + hf(xj,Uj), j = 0,1,.... 

Example 10.9 Consider the initial value problem 
u' = x 2 + u 2 , w(0) = 0, 

from Example 10.7. Table 10.1 gives the difference between the exact so- 
lution as computed by the Picard-Lindelof iterations in Example 10.7 and 
the approximate solution obtained by Euler’s method for various step sizes 
h. We observe a linear convergence as h — » 0. □ 



TABLE 10.1. Numerical example for the Euler method 



X 


h = 0.1 


h = 0.01 


h = 0.001 


h = 0.0001 


0.1 


0.000333 


0.000048 


0.000005 


0.000000 


0.2 


0.001667 


0.000197 


0.000020 


0.000002 


0.3 


0.004003 


0.000446 


0.000045 


0.000005 


0.4 


0.007357 


0.000798 


0.000080 


0.000008 


0.5 


0.011769 


0.001258 


0.000127 


0.000013 



There are three different interpretations of the approximation formula of 
Euler’s method: 

1. Replace the derivative by the difference quotient 

u(xi)-u(x 0 ) u \ tr \ 

« u (ar 0 ) = f{x 0 ,u 0 ) 

n 

and solve for u(x i). 

2. Integrate in the equivalent integral equation (10.5), i.e., in 

pXl 

u(xi)=u(x 0 )+ f(t,u(Q)d£ 

J Xo 

approximately by the rectangular rule 

fiX\ 

/ fit u(0) « hf(x 0 ,u 0 ). 

J Xq 

3. Use Taylor’s formula 

h 2 

u(x i) = u(x 0 ) + hu'(x o) + — u"(x o -b Oh) 

with 0 < 0 < 1 and neglect the remainder term; i.e., approximate 

u(x i) « u(xq) + hu f (x o). 
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Each of these three interpretations opens up possibilities for improve- 
ments of Euler’s method. For example, instead of the rectangular rule we 
can use the more accurate trapezoidal rule 

[ /(£, «(£))<*£« 77 [f(XQ,u(xo)) + /(Xi,u(xi))], 

Jx 0 ^ 

which yields 

U\ - u 0 + - [f(x 0 ,u 0 ) + /(*i,«i)]. (10.9) 

Repeating this procedure leads to the following method. 

Definition 10.10 The implicit Euler method for the numerical solution of 
the initial value problem (10.7) constructs approximations Uj to the exact 
solution u(xj) at the equidistant grid points 

xj :=x 0 +jh , j = 1 , 2 ,..., 



with step size h by 



Uj+l =Uj + - [f(xj,Uj) + f(x j+ i,u j+ i)], j = 0, 1, ... . 

This method is called an implicit method , since determining Uj+\ requires 
the solution of an equation that in general is nonlinear. In contrast, the 
Euler method of Definition 10.8 is an explicit method , since it provides an 
explicit expression for the computation of 1 . 

Remark 10.11 The nonlinear equations of the implicit Euler method can 
be solved by successive approximations, provided that the Lipschitz constant 
L for f and the step size h satisfy Lh < 2. 

Proof. We have to solve equation (10.9) for u\. Setting 
g(u) := u 0 + ^[f(xo,u 0 ) + f(xi,u)] 

we can rewrite (10.9) as the fixed point equation ui = g(u\). The function 
g is a contraction, since 



\g(u)-g(v)\ = ^\f(xi,u) - f(x u v)\ < ^-\u-v\, 
and therefore the assertion follows from Theorem 3.46. □ 

Since the solution of the nonlinear equation (10.9) will deliver only an 
approximation to the solution of the initial value problem, there is no need 
to solve (10.9) with high accuracy. Using the approximate value from the 
explicit Euler method as a starting point and carrying out only one itera- 
tion, we arrive at the following method. 
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Definition 10.12 The predictor corrector method for the Euler method 
for the numerical solution of the initial value problem (10.7), also known 
as the improved Euler method or Heun method, constructs approximations 
uj to the exact solution u{xj) at the equidistant grid points 

xj :=x 0 +jh, j = 1,2,..., 



by 

Uj + 1 := Uj + ^ l/(xj,Uj) + f(xj+ 1 , Uj + hf(xj,Uj))], j = 0,1,.... 

Example 10.13 Consider again the initial value problem from Example 
10.7. Table 10.2 gives the difference between the exact solution as computed 
by the Picard-Lindelof iterations and the approximate solution obtained by 
the improved Euler method for various step sizes h. We observe quadratic 
convergence as h -* 0. □ 

TABLE 10.2. Numerical example for the improved Euler method 



x 


h = 0.1 


h = 0.01 


h = 0.001 


0.1 

0.2 

0.3 

0.4 

0.5 


-0.00016667 

-0.00033326 

-0.00049955 

-0.00066530 

-0.00083027 


-0.00000167 

-0.00000333 

-0.00000500 

-0.00000668 

-0.00000837 


-0.00000002 

-0.00000003 

-0.00000005 

-0.00000007 

-0.00000009 



In the following section we will show that the Euler method and the 
improved Euler method are convergent with convergence order one and 
two, respectively, as observed in the special cases of Examples 10.9 and 
10.13. 

10.3 Single-Step Methods 

We generalize the Euler methods into more general single-step methods by 
the following definition. 

Definition 10.14 Single-step methods for the approximate solution of the 
initial value problem 

u ' = f(x,u ), u(x 0 ) = u 0 , 

construct approximations Uj to the exact solution u(xj) at the equidistant 
grid points 

xj :=x 0 + jh, j = 1 , 2 ,..., 
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with step size h by 

Uj+i . — tij + hip{xj , iij, /i), j 0, 1, , 

where the function ip : G x (0, oo) -» H is given in terms of the right-hand 
side f : G JR of the differential equation. 

Example 10.15 The Euler method and the improved Euler method are 
single-step methods with 

i p(x , u; h) = /(x, u) (10.10) 

and 

p(x,u;h) = - [f(x,u ) + /(x H- h,u + hf(x,u))] y (10.11) 

respectively. □ 

The function p describes how the differential equation 

u 1 = f(x,u) 

is approximated by the difference equation 

— [u(x -f h) - u(x)] = p(x, u\ h). 
h 

Prom a reasonable approximation we expect that the exact solution to the 
initial value problem approximately satisfies the difference equation. Hence, 

^ [u(x + h) — u(x)\ — p(x, u\ h) — > 0, h 0, 

h 

must be fulfilled for the exact solution u. We also expect that the order of 
this convergence will influence the accuracy of the approximate solution. 
These considerations are made more precise by the following definition. 

Definition 10.16 For each (x,u) € G denote by 7) — /?(£) the unique 
solution to the initial value problem 

v' = /(£,»?)> T)(x)=U, 

with initial data (x,u). Then 

A(x, u; h) [r](x + h) - rj(x)\ - p(x, u ; h) 

It 

is called the local discretization error. The single-step method is called con- 
sistent ( with the initial value problem) if 

lim A(x, u\h) — 0 

h — ►() 

uniformly for all (x, u) € G, and it is said to have consistency order p if 

|A(x, u; h)\ < KhP 

for all (x,u) € G, all h > 0, and some constant K. 
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Without loss of generality, in the sequel we always will assume that / 
(and later also derivatives of /) are uniformly continuous and bounded on 
G. This can always be achieved by reducing G to a smaller domain. 

Theorem 10.17 A single-step method is consistent if and only if 

lim <p(x, u ; h) = /(x, u ) 

/i— »o 

uniformly for all (x, «)gG. 

Proof. Since we assume / to be bounded, we have 

rj(x -f t) - 7]{x) = / r}'{x J ts)ds— / /(x + s,rj(x + s))ds -» 0, £ -> 0, 

Jo Jo 

uniformly for all (x,u) G G. Therefore, since we also assume that / is 
uniformly continuous, it follows that 

1 
h 



f 



[rf(x + t) - rj r (x)\ dt 



< max W (x + t) — Ti* (x)\ 

~ 0<t<h n 



= max | f(x -f t , r)(x -I- 1)) - f(x , 7j(x))| 0, h 0, 



uniformly for all (x,u) G G. From this we obtain that 
A (a;, u ; ft) + <^(x, u; A) - /(x, u) = i [?y(x + A) - ?j(x)] - r/'(x) 



h 

[r/(x + t) — r/(x)] dt 0, h -)• 0, 

uniformly for all (x,w) G G. This now implies that the two conditions 
A -» 0, h — > 0, and <£> — > /, h -> 0, are equivalent. □ 

Theorem 10.18 TAe isWer method is consistent. If f is continuously dif- 
ferentiable in G , £Aen the Euler method has consistency order one. 

Proof. Consistency is a consequence of Theorem 10.17 and the fact that 
ip(x,u;h) = f(x,u) for Euler’s method. If / is continuously differentiable, 
then from the differential equation rf = f(^,rj) it follows that r? is twice 
continuously differentiable with 

Tl” = Mt,V) + fu(Z,Tl)m,Tl). (10.12) 




Therefore, Taylor’s formula yields 



| A(x, u 5 A) | 



1 

h 



[Tf{x + h)~ »?(*)] - rf(x) 



= \ W'(x + 6h)\ < Kh 



for some 0 < 0 < 1 and some bound K for the function 2 (f x + f u f)- □ 
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Theorem 10.19 The improved Euler method is consistent. If f is twice 
continuously differentiable in G , then the improved Euler method has con- 
sistency order two. 

Proof. Consistency follows from Theorem 10.17 and 

<p(x, it; h) = ^ [/(x, u ) + f(x + h,u + hf(x , u))] -> /(x, it), h 0. 

If / is twice continuously differentiable, then (10.12) implies that r) is three 
times continuously differentiable with 

i" = /**(£,»?) + 2 fxu(Z,rj)f(Z,Tf) + fuu(t,v)P(t,v) 

+fu(S,T})fx(t,Tl) + ]). 

Hence Taylor’s formula yields 

r)(x + h) - rj(x) — hr)'(x) - rj"{x) = \r] ,n {x + 6h)\ < K\h 3 (10.13) 

2 6 

for someO <6 < land a bound Ki for 6(f xx + 2f xu f + f uu f 2 + f u fx + f 2 f)- 
From Taylor’s formula for functions of two variables we have the estimate 

| f(x + h,u + k) - f(x,u) - hf x (x,u) - kf u (x,u)\ < ^ K 2 (\h\ + |fc|) 2 

with a bound K 2 for the second derivatives f xx , f xu , and f uu . From this, 
setting k = /i/(x,it), in view of (10.12) we obtain 

| f(x + h,it + hf(x,u)) - /(x,it) - /n/'(x)| < ^ if 2 (l + #o) 2 ^ 2 

with some bound Ko for /, whence 

y>(x,u;h) - /(x,u) - ^ 77" (x) < j K 2 (l + #o) 2 h 2 (10.14) 

follows. Now combining (10.13) and (10.14), with the aid of the triangle 
inequality and using the differential equation, we can establish consistency 
order two. □ 

We proceed by investigating the convergence of single-step methods as 
the step size h tends to zero. This is done for the solution to the initial 
value problem in a fixed interval [a, b] with initial data at xo = a and the 
step size h and the number n of steps chosen such that x n = b. 

Definition 10.20 Assume that in the interval [a, b] at the equidistant grid 
points 

xj := x 0 +jh, j = 0, l,...,n, 
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with xq — a and x n = 6, approximate values uj for the solution u(xj) to 
the initial value problem 

u' = / (Xj u), u(x 0 ) = u 0 , 

are obtained by a single-step method . Then 

ej =ej(h) :-Uj -u(xj ), j = 

is called the global error, and 

E — E(h) max |ej(/i)| 

j=0,...,n 

is called the maximal global error. The single-step method is called conver- 
gent if 

lim E(h) = 0, 

h->0 

and it is said to have convergence order p if 

E(h) < Hh p 

for all h > 0 and some constant H. 

The following lemma is needed for our convergence analysis. 

Lemma 10.21 Let (£j) be a sequence in JR with the property 

l£j+i| < (i + + B, j — o, l, ... , 

for some constants A > 0 and B > 0. Then the estimate 

\tj\<\to\e jA + j (e jA -l), j = 0,1,..., 

holds. 

Proof We prove this by induction. The estimate is true for j = 0. Assume 
that it has been proven for some j > 0. Then, with the aid of the inequality 
1 +A < e A , which follows from the power series for the exponential function, 
we obtain 

\ti+i\ < (1 + AMo\e^ A + (1 + A)j (e’ A - 1 )+B 

< |Co|e (j+lM + ^ (e (i+1)j4 - l); 

i.e., the estimate also holds for j + 1. □ 
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Theorem 10.22 Assume that the function ip describing the single-step 
method is continuous (also with respect to h) and satisfies a Lipschitz con- 
dition; i.e 

\<p(x, u ; h) - (p(x, v\h)\ < M\u - v\ 

for all (x,u), (x,v) E G, all (sufficiently small) h, and a Lipschitz constant 
M . Then the single-step method is convergent if and only if it is consistent. 

Proof We first show that consistency implies convergence and assume that 
the single-step method is consistent. For the difference of two consecutive 
errors we compute 

e j+i - ej = [u J+l - uj ] - [u(x J+ i) - u(xj)} 

= htp(xj,Uj-,h) - [u(x j+ 1 ) - u(xj)] 

= h[(p(xj, Uj] h) — (f(xj,u(xj);h) — A (xj,u(xj);h)\. 

Hence 

|ej+i — ej\ < h[M\uj — u(xj) | -F c(h)], (10.15) 

where 

c(h) := max \A(x,u(x)]h)\ 

a<x<b 

satisfies 

c(K) — y 0, h — ^ 0, 

since we assume consistency. The inequality (10.15) implies that 

l e j+i I < (1 + hM) \ej\ + hc(h), j = 0, 1, . . . ,n. 

From this, applying Lemma 10.21 for A = hM and B = hc(h) and using 
6q = 0, we obtain the estimate 

M < ^ (e M <^-"o) -l), j = 0,l,...,n. (10.16) 

This establishes the convergence 

E (h) < ^ - l) ^ 0, h -> 0. 

We now show that convergence implies consistency and assume that the 
single-step method is convergent; i.e., for h -y 0 the approximations 

Uj+i Uj + hp{xj,Uj\ h) (10.17) 

converge to the solution of 



u'(x) - f {x , u ) , u{x 0 ) = Uo, 
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for all initial data (xo,uo) £ G. We set 

g(x,u) := <p(x,u; 0) 

and observe that by Theorem 10.17 the single-step method is also consistent 
with the initial value problem 

u'{x) - g(x,u), u(x 0 ) = u 0 . (10.18) 

Since we have already shown that consistency implies convergence, the 
approximations (10.17) also converge to the solution of (10.18); i.e., the 
solutions of the two initial value problems coincide. Therefore, we have 
/(#o,uo) = g(xo,u 0 ), and since this holds for all (#o,uo) £ G, from the 
continuity of ip we conclude uniform convergence: 

ip(x, u\ h) /(a?, u), h -> 0. 

Now consistency follows from Theorem 10.17. □ 

Theorem 10.23 Assume that the single-step method satisfies the assump- 
tions of the previous Theorem 10.22 and that it has consistency order p; 
i.e., \A(x,u]h)\ < Kh p . Then 

M<^ (e" ( *'- ao )-l)/* p , J=0,l,...,n; 

i.e., the convergence also has order p. 

Proof. This follows from (10.16) with the aid of c(h) < Kh p . □ 

Corollary 10.24 The Euler method and the improved Euler method are 
convergent. For continuously differentiable f the Euler method has conver- 
gence order one. For twice continuously differentiable f the improved Euler 
method has convergence order two. 

Proof. By Theorems 10.18, 10.19, 10.22, and 10.23 it remains only to verify 
the Lipschitz condition of the function ip for the improved Euler method 
given by (10.11). From the Lipschitz condition for / we obtain 

I <p{x,u; h) - <p(x,v,h) \ 

< ^ | f(x,u) - f(x,v ) I + 1 \f(x + h,u + hf(x,u)) - f(x + h,v + hf{x,v))\ 

< ^ l« - v\ + Y I [u + hf( x, u)] - [v + hf(x, u)] I < L (l + ^) \u - v\; 



i.e., ip also satisfies a Lipschitz condition. 



□ 
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Single-step methods of higher order can be constructed as follows. For a 
set of real numbers si, £ = 2, . . . , ra, q*, i = 1 , . . . , £ — 1 , £ = 2, . . . , m, and 
a^, i — 1 , . . . ,m, the quantities 

fcl = f(Xj,Uj), 



k 2 = f(xj + s 2 h, Uj -I- c 2 ikih), 



k 3 = f(xj + s 3 h,Uj + c 3 \k\h + c 32 k 2 h), 



( m_1 \ 
km ~~ / I Xj S m h : Uj -|- h ^ ^ C-miki I 

are computed recursively, and then the approximation is obtained by 

m 

Wj-J-2 — Wj | h 

i=l 

The Euler method is described by m = 1 and aq = 1 and the improved 
Euler method by m = 2, 52 = 1, c 3 \ = 1, and oq = = 1/2. The basic 

goal in the design of higher-order methods is, for a given m, to determine 
the coefficients such that the order of consistency and hence the order of 
convergence becomes as large as possible. As an example, we shall consider 
the Runge-Kutta method, which is the most widely used and most suc- 
cessful single-step method. It was introduced by Runge in 1895 for a single 
differential equation and extended to systems of differential equations by 
Kutta in 1901. 

Definition 10.25 The Runge-Kutta method for the numerical solution of 
the initial value problem (10.7) constructs approximations Uj to the exact 
solution u(xj) at the equidistant grid points 

xj :=x 0 +jh, j = 1 , 2 ,..., 

with step size h by using the above higher- order method with 

k 1 = f(xj,Uj), 

k 3 = f (xj + ~ ,uj + ^ ki^j, 

, *( h h 

*3 = / I xj + - , Uj + - k 2 j , 

fc 4 = f(xj + h,Uj + hk 3 ), 
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and 

Uj + 1 = uj 4- — ( k\ 4- 2&2 4- 2ks 4- £ 4 ). 

For the differential equation u' = f(x) the Runge-Kutta method coin- 
cides with Simpson’s rule for numerical integration. 

Theorem 10.26 The Runge-Kutta method is consistent. If f is four-times 
continuously differentiable , then it has consistency order four and hence 
convergence order four. 



Proof. The function ip describing the Runge-Kutta method is given recur- 
sively by 

ip — - (ipi 4- 2(^2 4- 2ip3 4- ¥> 4 ), 

D 



where 



<Pi (x,u;h) = f(x,u), 



(p 2 (x,u;h) = / (x+ ^ ,u + ^ <pi(x,u-,h) ] , 



= f ( a 



ip 3 (x,u-,h) = f[x+^,u+^ (p 2 {x,u-,h )j , 



P 4 (x,u] h) = f(x + A,u + h(pz(x,u; h)). 

From this, consistency follows immediately by Theorem 10.17. 

Analogously to the proof of Theorem 10.18 for the improved Euler method, 
the consistency order four can be established by a Taylor expansion of 
(p(x,u;h) with respect to powers of h up to order h 4 and expressing the 
derivatives of rj on the right-hand side of 

1 [r)(x + h)~ T](x)) = T]'(x) + ^ 7]"(x) + y T]'" (x) + ^ T)"" {?) + 0(h 4 ) 

through / and its derivatives by using the differential equation. We leave 
the details as an exercise for the reader (see Problem 10.9). □ 



The error estimate in Theorem 10.23 is not practical in general, since the 
constants M and K have to be determined from higher-order derivatives of 
/. Therefore, in practice, the error is estimated by the following heuristic 
consideration. For convergence order p, the error between the approximate 
solution u(x;h) at the point x : obtained with step size /z, and the exact 
solution u(x) satisfies 

u(x ; h) — u(x) « ch p 

for some constant c. Correspondingly, for step size hj 2 we have that 
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Subtracting these relations yields 

u(x; h) — u ^x; « c Q j (2 P - 1). 



Now the constant c can be eliminated from the last two relations, with the 
result that 



u 




u(x) « 



1 

2 p - 1 



u(x ; h) — u 




(10.19) 



Hence we may consider (10.19) as an estimate for the error occurring with 
the smaller step size h/2. However, we need to keep in mind that (10.19) 
does not provide an exact bound and might fail in particular situations. 
Nevertheless, it can be used for controlling the step size during the course of 
the numerical calculations in order to adjust the actual step size according 
to the required accuracy. 

Solving for u(x) in (10.19) yields 



u(x) « 




2 p - 1 



( 10 . 20 ) 



We leave it as an exercise for the reader to interpret (10.20) as a Richard- 
son extrapolation, which we explained in detail for the case of numerical 
integration in Section 9.5. 



10.4 Multistep Methods 

In the single-step methods each computed function value of / is used only 
in one step. It is natural to try to design methods where each computed 
function value of / is used in several steps. This leads to multistep methods, 
as described in the following definition. 

Definition 10.27 Multistep methods for the approximate solution of the 
initial value problem 



v! - f(x,u), u(xo) = u 0 , 

construct approximations Uj to the exact solution u(xj) at the equidistant 
grid points 

Xj :=x 0 +jh, j = 1 , 2 ,..., 

with step size h by 

Uj~ j-y* — f - O'p — \U j — j * — (— CLqU j — h(p(Xj , Uj , . . . , Wj-j-j. — j , /l) 

for j = 0, 1, Here ip is a function of r - 1-2 variables given in terms of 

f, and ao, . . . , a r _i are constants. 
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To start such a multistep method involving r steps, r starting values 
w 0 , wi, • . . , u r -i are required. For example, these can be approximately 
computed from the initial value u$ by a single-step method such as the 
Runge-Kutta method. 

A particular class of multistep methods is obtained by approximating 
the integral in 



r x j + r 

u(x j+r ) - u(x j+T -k) = / /(£.«(£))<*£ 

Jxj+ r -k 

with 1 < k < r by an interpolatory quadrature, i.e., by 

r f ,+ ' p(0de, 

" x j -\-r — k J % j + r— k 



where p E P s with 0 < s < r is the uniquely determined polynomial with 
the interpolation property 



P( x j+m) — f ^j+ra), ^ — 0, . . . , S, 



i.e., by setting 

r X 3+r 

'M/'-f-r 'U’j+r—k — I P(£) d£. (10.21) 

J Xj + r -k 

Integrating the Lagrange representation (8.2) of the interpolation polyno- 
mial shows that these multistep methods are of the form 

s 

^ j+r 'U'j+r — k — h ^ ^ brnfifij+rni'U'j+rn) 
m = 0 

with coefficients bo , . . . , b s depending on r, k, and s. 

From (10.21) we can generate a variety of methods by choosing the num- 
ber of steps r, the number s + 1 of interpolation points, and the number 
k of integration intervals appropriately. We briefly report on some of these 
methods. 

The Adams-Bashforth method , introduced by Adams and Bashforth in 
1883, is obtained by taking k = 1 and s = r—1. For r = 1 the interpolation 
polynomial is a constant, and therefore 



Uj + 1 = Uj + hf(xj,Uj). 



( 10 . 22 ) 



For r = 2 the interpolation is linear and leads to 

u j + 2 — u j + 1 2 > w j+i) — f( x ji u j)] (10.23) 

(see Problem 10.12). The Adams-Bashforth method is explicit. Clearly. 
(10.22) coincides with the Euler method from Definition 10.8. 
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The Adams-Moulton method , devised by Moulton during World War I, 
is given by k = 1 and s — r. For r = 1 the interpolation is linear, whence 

Uj + 1 = Uj + | [f(xj+ x , u i+ i) + %)]• (10.24) 

For r = 2 the interpolation is quadratic, leading to 

Uj +2 = u j+ i + ^ [5/(x j+ 2,Uj+2) + Bf(xj +U u j+1 ) - /(Xj,«j)] (10.25) 

(see Problem 10.12). The Adams-Moulton method is implicit. One iteration 
step for the solution of the nonlinear equation for Uj+ r starting with the 
approximation given by the corresponding Adams-Bashforth method leads 
to a predictor corrector method. Clearly, (10.24) coincides with the implicit 
Euler method from Definition 10.10. 

The explicit method for k = 2 and s = r — 1 is known as the Nystrom 
method , and the implicit method for k = 2 and s = r is called the Milne - 
Thomson method (see Problem 10.14). 

Definition 10.28 For each (x,u) £ C? denote by T] — r)(£) the unique 
solution to the initial value problem 

n' = f{£,y), y{x) = u, 



for the initial data (x,u). Then 



A(x, u; h) 



1 

h 



r — 1 

t](x + rh) + ^2 a m T)(x + mh) 

m — 0 



-<p(x, T](x), . . . ,T)(x + (r - 1 )h); h) 

is called the local discretization error. The multistep method is called con- 
sistent (with the initial value problem) if 

lim A(x, u\h) — 0 

h — ^0 

uniformly for all (x,u) £ G, and it is said to have consistency order p if 

|A(x,u;ft)| < KhP 

for all (x,u) £ G, all h > 0, and some constant K. 

Theorem 10.29 If f is (s + 1) -times continuously differentiable, then the 
multistep methods (10.21) are consistent of order s H- 1. 

Proof By construction we have that 

^ px+rh 

A(x,u-,h) = - / 

^ J x+(i — k)h 



[/(£,«(£)) 
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where p denotes the polynomial satisfying the interpolation condition 
p(x + mh) = f(x -f mh , r)(x + ra/i)), m — 0, . . . , s. 



By Theorem 8.10 on the remainder in polynomial interpolation, we can 
estimate 



\mMO)-p(o\<Kh s+i 



for all £ in the interval x + (r — k)h < £ < x + rh and some constant K 
depending on / and its derivatives up to order s + 1. □ 



Analyzing the convergence for multistep methods is more involved than 
for single-step methods for the following two reasons. Firstly, the approx- 
imation obtained by a multistep method is, of course, also influenced by 
the errors 

e 3 u 3 - u ( x j)i 3 = 0, . . . ,r - 1, 
in the starting values. Hence we give the following definition. 

Definition 10.30 The starting values Uj , j = 0, . . . , r — 1, are called con- 
sistent if 

lim [uj(h) - u(xj )] =0, j = 0, . . . ,r - 1. 

h— >0 

They are said to have consistency order p if 



Mh) - u(xj ) | < K*h p , j = 0, . . . ,r - 1, 



for all h > 0 and some constant K* . 



To make sure that the consistency order of the starting values coincides 
with the consistency order of the multistep method, the single-step method 
for computing the starting values has to be chosen accordingly. 

Secondly, multistep methods can be unstable, as illustrated by the fol- 
lowing example. 

Example 10.31 Let p be the quadratic interpolation polynomial satisfy- 
ing 

p(xj) = u(xj ), j = 0,1,2, 



and approximate 

u'(x 0 ) &p'(x o). 

Using the fact that the approximation for the derivative is exact for poly- 
nomials of degree less than or equal to two, simple calculations show that 
(see Problem 10.15) 



p '( x °) = ^ [~u(x 2 ) + 4u(xi) - 3w(a;o)]. 



(10.26) 
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If u is three times continuously differentiable, by Theorem 8.10 we have 



u(x) — p(x) 

X — Xq 



< ^ ||w"'||oo|(a:-a:i)(a:-a; 2 )|, 



and from this, passing to the limit x -» xo, it follows that the error for the 
derivative can be estimated by 



IK'll 



(10.27) 



By approximating 

p'(x 0 ) « u'(x o) = f(x 0 ,u 0 ) 
we derive a multistep method of the form 



Uj +2 ~ 4uj-j_i + 3 Uj = -2 hf{xj,Uj), j - 0,1, (10.28) 

From (10.27) it follows that (10.28) is consistent with order two if / is twice 
continuously differentiable. 

Now we consider the initial value problem 



u' — — u, u( 0) = 1, 

with the solution u(x) — e ~ x . Here the multistep method (10.28) reads 

Uj +2 ~ 4uj + 1 -f (3 — 2h)uj — 0, j = 0, 1, (10.29) 

Table 10.3 gives the error ej — Uj —e~ Xj between the approximate and exact 
solutions for the step sizes h — 0.1 and h — 0.01. For the starting values, 
i/o = l and u\ — e~ h have been used with ten-decimal-digits accuracy. The 
last column gives the quotient qj ej/ej- \ of the error in two consecutive 
steps. 



TABLE 10.3. Numerical results for Example 10.31 



h = 0.01 


j 


Xj 


e 3 


Qj 


5 


0.05 


0.0000 


3.23 


10 


0.10 


0.0099 


3.01 


15 


0.15 


2.4456 


3.01 


20 


0.20 


604.1985 


3.01 



h = 0.1 


j 


Xj 


e j 


Qj 


4 


0.4 


0.0109 


3.60 


6 


0.6 


0.1123 


3.15 


8 


0.8 


1.0858 


3.10 


10 


1.0 


10.4143 


3.10 



In order to explain the numerical failure indicated by the results in Table 
10.3, we solve the difference equation (10.29) by looking for solutions of the 
form 



Uj = a \ J , 



( 10 . 30 ) 
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where a and A are complex numbers. Substituting into (10.29) shows that 
(10.30) solves (10.29) if and only if A is a solution of the so-called charac- 
teristic equation 

A 2 - 4A + (3 - 2 h) = 0. 

This quadratic equation has two solutions, namely 

Ai, 2 = 2 =F Vl + 2 h. 

Therefore, the general solution of (10.29) is given by 

uj — aX\ -f bX 2 . 

The two constants a and b are determined by the conditions uq = 1 and 
u\ — e~ h and have the values 

a = X \. 2 _ 6 \ 1 = 1 + 0 

and 

A2 — Ai 

The term aX\ in the solution to the difference equation approximates the 
solution e~ Xj — e _j/l to the initial value problem, since 

a\{ = [1 -f 0{h 2 )} [1 - h + 0(h 2 )} j » e~ jh . 



However, the additional term bX J 2 grows exponentially, and the relation 



Uj-i 



u{Xj) 

u(Xj-i) 



^ A 2 — 3 + /i + 0(/i 2 ) 



explains the last column of Table 10.3. 



□ 



Roughly speaking, for multistep methods with r > 2, the (homogeneous) 
difference equation of order r occurring in the multistep method has r lin- 
early independent solutions, whereas the approximated differential equa- 
tion has only one solution. Hence only one of the solutions to the difference 
equation corresponds to the differential equation. Therefore, convergence 
of the multistep method can be expected only when the additional solu- 
tions to the difference equation remain bounded. Note that these additional 
solutions will always be activated by errors in the starting values and by 
round-off errors. For this reason we proceed by investigating the stability 
of the difference equation. 



Definition 10.32 The linear difference equation 

r— 1 

Uj+r ^ ^ — 9? j = 0, 1 , . . . , (10.31) 

m — 0 

with constant coefficients ao, . . . , a r -i called stable if all its solutions are 
bounded. 
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Theorem 10.33 The linear difference equation (10.31) is stable if and 
only if it satisfies the root condition, i.e., if all the zeros A of the charac- 
teristic polynomial 

i — i 

p( A) := A r + ]T a m \ m (10.32) 

m = 0 

have absolute value |A| < 1, and zeros satisfying |A| = 1 are simple zeros. 

Proof. We begin by noting that each solution to the difference equation 

(10.31) is uniquely determined by its r initial values wq, wi, . . . , u r _ i. Ob- 
viously, from these initial values the remaining terms u r ,u r+ 1 , . . . are re- 
cursively determined by (10.31). 

For convenience we set a r — 1 and denote by A the differential operator 
given by (Af) (A) = A/' (A). Then for the sequence 

Uj=j n X j , j = 0,1,..., (10.33) 

we have that 

r— 1 r 

u j+r “h ^ ^ ^ ] a m{j 4“ ^) 

m = 0 m = 0 

k=0 ^ m = 0 

= AJ E(fc)^^ n ^)( A )- 

k=0 ' ' 

From this it can be deduced that if A is a zero of the characteristic poly- 
nomial p of multiplicity s, then for n = 0, 1, . . . , s — 1 the sequence (10.33) 
solves the difference equation. 

Now assume that Ai , . . . , A& are the zeros of the characteristic polynomial 

(10.32) and have multiplicities s \, . . . , i.e., 

k 

i=i 

Then the general solution of the homogeneous difference equation (10.31) 
is given by 

k si — 1 

aufXj (10.34) 

/=1 s = 0 

with r arbitrary constants ai s . To establish this we need to show that the 
coefficients can be chosen such that arbitrarily given initial conditions 

/ = 1 8 = 0 



j = 0, ...,r - 1, 



(10.35) 
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are fulfilled. The homogeneous adjoint system to the system (10.35) reads 

r— 1 

^2 Pjftf =0. * = 0, • • • , Si - 1, l = 1, . . . , k. 

j=o 

Assume that /?j, j = 0, . . . , r — 1 is a solution. Then the polynomial 

r— 1 

q(X) := 

j = 0 

of degree r — 1 has the zeros A / with multiplicity si for / = 1 , . . . , k\ i.e., 
the polynomial has r zeros and therefore, by Theorem 8.1, must vanish 
identically. This implies f3 0 = • • • = /J r _i = 0. Hence, for each given right- 
hand side the system (10.35) has a unique solution. 

Now from the form (10.34) of the general solution to the difference equa- 
tion, the equivalence of stability and the root condition is obvious. □ 

Besides the solution (10.34) of the homogeneous difference equation, we 
also will need an explicit expression for the solution to the inhomogeneous 
difference equation. 

Lemma 10.34 For k — 0, 1, . . . , r — 1, let uj ^ denote the unique solutions 
to the homogeneous difference equation (10.31) with initial values 

u j,k — &j,ki j — 0 , 1 , . . . , r — 1 . 

Then for a given right-hand side c r ,c r + 1 ,..., the unique solution to the 
inhomogeneous difference equation 



r— 1 

Zj+r ^ ^ Q'mZj+m — Cj+ri j ' ~ 0, 1, . . . , (10.36) 

m— 0 

with initial values zo,zi, ... , z r -\ is given by 

r— 1 j 

Zj+r = ^ %k T ^ ^ Cfc+r > ^ , j-\-r—k—l,r—l > j — 0, 1, ... . (10.37) 

k=0 k=0 

Proof. Setting u mjr _i = 0 for m = — 1, —2, . . . , we can rewrite (10.37) in 
the form 

r— 1 

H = ^2 z k u j,k +Wj, j = 0,1 ,..., 
k= 0 

oo 

Wj I — ^ ^ Ck+r 'U'j — k—l,r—l 5 j = 0,1,.... 
k=0 



where 
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Obviously, Wj = 0 for j = 0, . . . ,r — 1, and therefore it remains to show 
that Wj satisfies the inhomogeneous difference equation (10.36). 

As in the proof of Theorem 10.33 we set a r = 1. Then, using u m , r _i = 0 
for m < r — 1 , u r - i, r _i = 1 , and the homogeneous difference equation for 
u m ,r— i ? we compute 

r r oo 

^ ^ Q'm'Wj+m — ^ ^ ^ ^ fc — l,r — 1 

m=0 m=0 /e=0 



r 3 

— ^ ^ Qm ^ ^ A;— l,r— 1 

m=0 Jfe=0 



i r 

— ^ ^ c k+r ^ ^ Q'm'U'j+m — k — l,r — 1 = c j+r- 
k=0 m = 0 

Now the proof is completed by noting that each solution to the inhomo- 
geneous difference equation (10.36) is uniquely determined by its r initial 
values zq,zi,..., z r -\ . □ 



Definition 10.35 The multistep method of Definition 10.27 is called sta- 
ble if the associated difference equation 

r - 1 

Uj-\. r ^ ^ — 0 

m — 0 



is stable. 



Single-step methods are always stable, since the associated difference 
equation Uj+i — uj — 0 clearly satisfies the root condition. 

Remark 10.36 The multistep methods (10.21) are stable. 

Proof. The corresponding characteristic polynomial p(A) := A r — \ r ~ k ful- 
fills the root condition. □ 



For establishing convergence of multistep methods, we will need the fol- 
lowing extension of Lemma 10.21. 

Lemma 10.37 Let (£j) be a sequence in Et with the property 

3~ 1 

+ J = 1 , 2 ,-.., 

m— 0 

for some constants A > 0 and B > 0. Then the estimate 

Ifcl <(^|&|+B) e O- 1 M, j = 1 , 2 ,..., 

holds. 
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Proof. We prove by induction that 

\^\<(A\^\ + B)(l + Ay-\ j = 1,2, — (10.38) 

Then the assertion follows by using the estimate 1 + A < e A . The inequality 
(10.38) is true for j — 1. Assume that it has been proven up to some j > 1. 
Then we have 

l£,+i| < A |£ m | + B < (yllCol + B) + A £ {A\&\ + B)(l + .4)" 1 - 1 

m= 0 m— 1 

i.e., the estimate is also true for j + 1. □ 

Theorem 10.38 Assume that the function ip describing the multistep method 
is continuous and satisfies a Lipschitz condition; i.e., 

r — 1 

\ip(x,u 0 ,ui,. . . ,u r -i;h) - ip(x,v 0 ,vi, . . . ,v T -i; h)\ < M ^ |u m Vm | 

m— 0 

for all ( x , u 0 ), . . . , (x, u r -i)(x , u 0 ), . . . , (x, v r -i) G G, all (sufficiently small) 
h, and a Lipschitz constant M. Furthermore, assume that the multistep 
method is consistent and stable and that the starting values are consistent. 
Then the multistep method is convergent. If both the multistep method and 
the starting values have consistency order p, then the convergence also is 
of order p. 

Proof. (Compare to the proof of Theorem 10.22.) For the errors 

e, := Uj - u{xj) 



we obtain 

r— 1 r— 1 r— 1 

C j-\-r H“ ^ ^ Clm6j+m ~~ Hj+r “b ^ ^ ^ ^ 

m= 0 m — 0 m = 0 



= h<p(xj,Uj , . . . , uj- (- r -i; h) — hA(xj,u(xj)] h) 
-hp(xj,u(xj ), . . . , u(xj+ r - 1); h). 

We rewrite this into the form 

r — 1 

Cj+r "b ^ ^ — h-Cj+ r , j = 0, 1, . . . , (10.39) 

771=0 
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where 

Cj- \-r • ? • • • ) ^j+r— 1 5 ^0 ^(*£7 j ^(*^j ) 5 

. . . , u(aJj+r-i); /i)- 

We can estimate the right-hand side by 

r— 1 

|cj+ r | < M \e j+m \ +c(h), j = 0,1,..., (10.40) 

m = 0 



where 



c(h) = max |A(z, u(x); h)| 

a<x<b 



satisfies c(h) -> 0, h -» 0, since we assume consistency. By Lemma 10.34 
we can express the solution of (10.39) in the form 



i — l j 

^j+r — ^ ^ ^A; ^j+r,fc "I" h ^ ^ Cfc+r ^j+r— A; — 1 ,r — 1 5 J — 0, 1, ... . 
A:=0 k = 0 

From this, since we assume stability, we can estimate 

l e j+r| < N 1 d(h) + h ^ ^ |cfc_|_ r | 1 , j — 0, 1, ... , 

l A;=0 J 

for some constant N and 



d(h) :=J2\e k \. 

k = 0 

We note that d(h) -+ 0, h -» 0, since the starting values are assumed to be 
consistent. Inserting (10.40) into the last inequality now yields 

|e i+r | < N ld(h) + hM^2 E l e *+™l + 0’ + > 3 = 0,1,.... 

I A;=0 m = 0 J 

Because of 

j r— 1 r — 1 m+j r+j - 1 r+j — 1 

EE |efc+m| = E E |e fc | < 7* E |e fc |=r E |e*|+rd(/i) 

m=0 m=0 k—m k =0 k—r 

and (j + l)h < Xj+i — x 0 < 2(5 — a) we obtain that \e r \ < C^{h) and 



r+j — 1 



kj+r| <Cyy(h) + h ^ \e k \ f , J = 1, 2, ... , 



k=r 




254 



10. Initial Value Problems 



for some constant C and 7 (h) := d(h) -I- c(h). Now Lemma 10.37 implies 
that 

|e j+r | < C'[|e r |/ t + 7 (ft)] t ^- 1 ) cft < C(l + Chh(h)e^- X ^ c , j = 1 , 2 ,..., 
whence 

E(h) < C( 1 + Ch)'y(h)e (b - a)c 0, h -> 0, 

follows, since 7 (h) -» 0, h 0. For consistency order p we have that 
7 (/i) = 0(h p ); i.e., the convergence is also of order p. □ 

The basic advantage of multistep methods results from the fact that for 
arbitrary convergence order, in each step only one new evaluation of the 
function / is required. In contrast, for single-step methods the number of 
function evaluations required in each step is equal, in general, to the conver- 
gence order. Therefore, multistep methods are much faster than single-step 
methods. However, it should be noted that readjusting the step size dur- 
ing the computation is more involved due to the need to recompute the 
corresponding starting values for the new step size. 



Problems 

10.1 Find the exact solution of the initial value problem 

u — — u 2 , u(0) = 1, 

and compare it to the approximate solutions obtained by successive approxima- 
tions according to Corollary 10.6. Compute the third iterate 113 and compare the 
exact error u — U 3 to the a posteriori error estimate from Corollary 10.6. 

10.2 Consider the initial value problem u = u, w(0) = 1, and show that the 
approximate solution from the Euler method is given by Uj = (1 + hy . 

10.3 Find the exact solution of the initial value problem 

u = 2 - , u(l) = 1. 

X 

Determine an analytic expression for the approximate solution by Euler’s method 
and verify the convergence order one predicted by Theorem 10.18. 

10.4 Show that Euler’s method fails to approximate the solution u(x) = ( f x) 3/2 
of the initial value problem u f = u 1/3 , it(0) = 0. Explain this failure. 

10.5 Show that the differential equation u = ax with a E 1 R is solved exactly 
by the improved Euler method. 
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10,6 Show that the single-step method 

Uj+i =u j+hf (xj + ^,Uj + ^ f(xj,Uj )) 

has consistency order two if / is twice continuously differentiable. This method 
is known as the modified Euler method . 



10.7 Show that the single-step method given by 






k 2 = f (a;, + f + \ fcl )’ 

, r( 2 h 2 h , \ 

*3 = J \ Xj + — ,Uj + — k 2 ) , 



and 



Uj+i — Uj 4- — ( k\ 4- 3ks) 



is consistent and has consistency order three if / is three-times continuously 
differentiable. This method is known as Heun’s third- order method. 



10.8 Show that the single-step method given by 
ki = f(xj,Uj ), 

k 2 = f(xj + ^ ,Uj + | fci), 

&3 = f(xj 4- h, Uj — hk\ 4- 2/1/12), 



and 



U 7+1 = Uj 4- — (ki 4- 4&2 4 /^ 3 ) 
o 

is consistent and has consistency order three if / is three-times continuously 
differentiable. This method is known as Kutta’s third- order method. 



10.9 Show that the Runge-Kutta method (see Definition 10.25) has consistency 
order four if / is four-times continuously differentiable. 

10.10 Write a computer program for the Runge-Kutta method and test it for 
various examples. 

10.11 The population p = p(t) and q = q(t) of two interacting animal species 
that have a predator prey relationship is modeled by the system of the Lotka- 
Volterra equations 

dp „ dq c 

— = ap -h 0pq t — = 7 ^ 4- Spq 

with constant coefficients a < 0, 0 > 0, 7 > 0, and 5 < 0, complemented by initial 
conditions p(0) = po and q( 0) = qo. (Explain the significance of the signs of the 
constants for the model.) For the coefficients a = — 1, 0 = 0.01, 7 = 0.25, and 
S = —0.01, test the stability of the solutions by solving the initial value problem 
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numerically by the Runge-Kutta method for the four different initial conditions 
po = 30 ± 1 and qo = 80 ± 1. Visualize the numerical results by a phase diagram, 
i.e., by the curve {(p(t),q(t) : t G [0,T]} for sufficiently large T > 0. 

10.12 Verify the coefficients in the Adams-Bashforth and Adams-Moulton meth- 
ods (10.22)-(10.25). 

10.13 Determine the coefficients of the Adams-Bashforth and Adams-Moulton 
methods for r = 3. 

10.14 The multistep methods (10.21) for k = 2 and s = r — 1 and for k = 2 
and s = r are known as the Nystrom method and the Milne-Thomson method, 
respectively. Determine the coefficients of the Nystrom method and the Milne- 
Thomson method for r = 1 and r — 2. 

10.15 Verify the coefficients in the difference formula (10.26). 

10.16 Construct a two-step method of the form 

itj-f-2 ■f’fli'Uj-j-i ~\~ o>QUj — h\bo f (xj ) Uj') -}- b\ , Uj+i )] 

that has consistency order two and discuss its stability. 

10.17 Find the general solution of the difference equation 



Uj~ (-2 — 2auj+i + auj = 1 



for 0 < a < 1. Show that limj_>oo Uj = 1/(1 — a). 

10.18 Find an explicit expression for the Fibonacci numbers aj , which are de- 
fined by ao = ui = 1 and aj+ 1 = aj + a j - 1 for j > 1. Is the root condition of 
Theorem 10.33 satisfied? 

10.19 Attempt to approximate the unique solution u(x) = 2 of the initial value 
problem 

u — xu(u — 2), w(0) = 2, 

numerically by any of the methods described in this chapter. Discuss the results 
by relating them to the solution of the initial value problem with perturbed initial 
condition w(0) = 2 4* a for small a G 1R. 

10.20 Consider the approximate solution of the initial value problem 

u + lOOu = 100, u(0) = 2, 

by the Euler method. Explain why for an accurate approximation the step size 
h has to be chosen smaller than h < 0.02 despite the fact that the solution is 
almost constant for x not too small, say, for x > 0.1. (This differential equation 
is an example of a so-called stiff equation , for which the numerical solution is 
rather delicate.) 
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Whereas in initial value problems the solution is determined by conditions 
imposed at one point only, boundary value problems for ordinary differ- 
ential equations are problems in which the solution is required to satisfy 
conditions at more than one point, usually at the two endpoints of the 
interval in which the solution is to be found. Since an ordinary differential 
equation of order n has, in principle, a general solution depending on n pa- 
rameters, the total number of boundary conditions required to determine 
a unique solution is n. For an introduction to some of the basic methods 
for the numerical solution of such boundary value problems we shall con- 
fine ourselves to the simplest boundary value problem, which is one for 
an equation of the second order in which the solution is specified at two 
distinct points. For more detailed studies we refer to [13, 36, 46]. 

As opposed to the fundamental Picard-Lindelof existence and uniqueness 
theorem for initial value problems, a detailed analysis of the existence and 
uniqueness theory for nonlinear boundary value problems is more involved 
and beyond the scope of this introduction. However, for linear boundary 
value problems the theory is more elementary, and we shall include part of 
it in our analysis. 

For the numerical solution of boundary value problems for ordinary dif- 
ferential equations three different groups of methods are available: shooting 
methods, finite difference methods, and finite element methods. Whereas 
shooting methods, which we briefly describe in Section 11.1 and which rely 
on numerical methods for initial value problems, are restricted to ordinary 
differential equations, the finite difference and finite element methods can 
also be applied to boundary value problems for partial differential equa- 
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tions. Therefore, our presentation of finite difference and finite element 
methods for linear ordinary differential equations is also meant as a model 
discussion for the more complicated and more important case of partial 
differential equations. 

Of course, in one chapter only a small part of the theory and the ap- 
plications of finite difference and finite element methods can be covered. 
Hence, we set ourselves the task to outline the basic ideas of these meth- 
ods by considering only the simplest cases. For a solid foundation of the 
finite element method, we felt it was necessary to include as its theoretical 
basis a discussion of the Galerkin method for strictly coercive operators, 
which appears in Section 11.3. This, in turn, made it necessary to present 
the Lax-Milgram theorem on the existence of solutions for equations with 
strictly coercive operators. 



11.1 Shooting Methods 



Consider the boundary value problem for the differential equation of the 
second order 



u" — f(x,u,u'] 


), a < x < 6, 


(11.1) 


with boundary conditions 






u(a) = a, 


II 

's' 


(11.2) 



For the sake of simplicity we assume that the function / is defined on 
[a, b ] x IR 2 . 

Shooting methods attempt to employ the numerical methods described 
in the previous chapter for initial value problems where, roughly speaking, 
the initial conditions at x — a are adjusted so that the solution satisfies the 
required boundary conditions (11.2). For this, in addition to the boundary 
value problem, we also consider the initial value problem 

u" = /(x,u,u'), u(a) = a, u'(a) = s, (11.3) 

with a real parameter s. Geometrically speaking, the parameter s prescribes 
the initial slope of the solution curve. 

If we assume that / is continuous and satisfies a Lipschitz condition 
with respect to u and u', then by the Picard-Lindelof Theorem 10.5, for 
each s G IR there exists a unique solution u(-, s) of the initial value problem 
(11.3). To arrive at a solution to the boundary value problem (11.1)— (11.2), 
the parameter s has to be chosen such that u(6, s) — /?; i.e., we have to 
solve the equation 

F(s) = 0, 

where the function F : IR -> IR is defined by 

F(s) :=u(b,s)-p . 
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For each s the value F(s) can be computed approximately by one of the 
numerical methods of Chapter 10 for the solution of initial value problems, 
extended appropriately to the case of a second-order equation. Note that 
for a nonlinear differential equation the equation F(s) = 0 is nonlinear. 

For finding a zero of F the Newton method of Section 6.2 can be em- 
ployed. For the computation of the derivative F'(s), which is required for 
Newton’s method, we assume that the solution u to the initial value prob- 
lem (11.3) depends in a continuously differentiable manner on the parame- 
ter s. This can be assured by appropriate assumptions on / (see [12]). We 
set 

du 

Vl= 

and differentiate the differential equation and the initial condition (11.3) 
with respect to s to obtain 



u"(x, s) — f u (x,u(x,s),u'(x,s))v(x,s) 

+ f u > (x, u(x , s), u'(x, s))v'(x, s) 



(11.4) 



and 

u(a,s) = 0, t/(a, s) = l. (11.5) 

Since 

F’(s)=v( 6,«), 

computing the derivative of F requires solving the additional linear initial 
value problem (11.4)— (11.5) for v, where u is known from solving (11.3). 
Note that from a numerical approximation, u is known only at grid points. 
Summarizing, we obtain the following method. 



Algorithm 11.1 The shooting method with Newton iterations consists of 
the following steps: 

1. Choose an initial slope s G IR. 

2. Solve numerically the initial value problem for 



u” = f(x,u,u') 



with initial conditions u(a) = a, u'(a ) = s and the initial value problem for 
v" = f u (x, u, u')v + f u > (x, u, u , )u' 
with initial conditions v(a) = 0, v'(a) = 1. 

3. Ifu{b) = (3 is satisfied within the required accuracy, then stop; otherwise, 
replace s by 

u(b) - 0 
S v{b) 



and go back to step 2. 
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Example 11.2 Consider the boundary value problem 

u” = u 3 , u(l) - \/2, u( 2) = ^ \/2, 

with the exact solution u(x) = y/2/x. We solve numerically the associated 
initial value problem 

u" = it 3 , u( 1) = \/2, //'(l) = 5 , 

by the improved Euler method of Section 10.2 with step sizes h = 0.1, 
h = 0.01, and h = 0.001. For this we transform the initial value problem 
for the equation of second order into the initial value problem for the system 

u = w, w f = u 3 , u(l) = y/2, w( 1) = s. 

As starting value for the Newton iteration we choose 5 = 0. The exact initial 
condition is 5 = — y/2 — —1.414214. The numerical results represented in 
Table 11.1 illustrate the feasibility of the shooting method with Newton 
iterations. □ 



TABLE 11.1. Numerical results for Example 11.2. 



h = 0.01 


5 


F{s) 


0.00000 


3.84079 


-0.74681 


1.26284 


-1.28234 


0.21124 


-1.40987 


0.00678 


-1.41424 


0.00000 


-1.41424 


0.00000 



h = 0.001 


5 


F(s) 


0.00000 


3.84400 


-0.74584 


1.26538 


-1.28180 


0.21210 


-1.40980 


0.00684 


-1.41420 


0.00000 


-1.41421 


0.00000 



h = 1 


0.1 


5 


F(s) 


0.00000 


3.61648 


-0.81116 


1.10056 


-1.31684 


0.15879 


-1.41553 


0.00373 


-1.41796 


0.00000 


-1.41796 


0.00000 



Numerical problems with ill-conditioning will arise in cases where small 
changes in the initial data 5 will cause large changes in the solution //(•, 5). 
This is illustrated by the following example. 

Example 11.3 The linear boundary value problem 

u” -xi - llOxi = 0, u(0) = u(10) = 1, 

has the unique solution 



u(x) = 



>110 _ p -100 



{(e 110 — l)e~ lQx + (1 



-100\_llx 



K 1 *}. 



gxiu _ e - 

The unique solution to the associated initial value problem with initial 
conditions m( 0) = 1 and i/'(0) = s is given by 



u(x) = 



11 



-10x 



+ 



10+5 

21 



,11* 



21 



21 
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Hence, in this case we have 
F( S ) = 



“ 5 ^-ioo , 10 + s „no i 

21 + 21 



From F(s) = 0 we deduce that the exact initial slope s satisfies 

*,-110 *,—210 



-10 < 5 = -10 + 21 



— e 



1 - e~ 210 



In a numerical computation with ten-decimal-digit accuracy the best ap- 
proximation s to the exact zero s we can expect is such that 

-10 <S < -10 -MO' 9 . 



Within this interval of initial conditions we now have 

u(10, —10) = e' 100 « 0 

and 



u(10, -10 + 10- 9 ) = 21 - 1 — e- 100 + ^ e 110 » 2.8 • 10 37 ; 

Z1 Zi L 

i.e., small changes in s will cause very large changes in the values of the 
solution at the other endpoint. Hence, we cannot expect that this bound- 
ary value problem can be numerically solved by the simple version of the 
shooting method. □ 

This difficulty can be remedied by a multiple shooting method as follows. 
The interval [a, b] is subdivided into n subintervals according to 



a = xq < x\ < • • < x n -i < x n = b. 

Then for given vectors u = (uq, . . . , u n - i) T and s = (so, • • . , s n -i) T in IR n 
such that uo = a, for j — 0, . . . , n — 1 consider the n initial value problems 
for 

u" = /(x,u,u') 

on the subintervals [xj,Xj+ 1 ] with initial conditions 
u(Xj)=Uj, u'(Xj) = Sj. 



In order to obtain from this a solution to the differential equation on all 
of the interval [a, 6], the solutions u(-,Uj,Sj) on the subintervals [xj,Xj+ 1 ] 
have to coincide at the grid points aq, . . . , x n -i together with their first 
derivatives. Then the differential equation ensures that the function is twice 
continuously differentiable on [a, 6]. In addition, the boundary condition 
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u(b) = [5 must be satisfied. Altogether we have the following 2n— 1 nonlinear 
equations for the 2n — 1 unknowns U \ , . . . , u n -\ and sq, . . . , s n _i : 

- uj + 1 =0, j = 0, . . . ,n - 2, 
u'O&j+i , Uj ,Sj) - Sj+i = 0, j = 0, . . . , n - 2, (11.6) 

u(z n ,u n _i,s n _i) - P = 0. 

For the solution of this system Newton’s method can again be used. For 
details we refer to [36, 50]. 



11.2 Finite Difference Methods 

As already indicated in Example 2.1, the basic idea of finite difference 
methods for the approximate solution of boundary value problems consists 
in replacing the derivatives in the differential equations by difference quo- 
tients. For the sake of simplicity, we confine our presentation to a linear 
boundary value problem. Without loss of generality we need consider only 
the homogeneous boundary condition, since inhomogeneous boundary con- 
ditions can be dealt with by incorporating them into the right-hand side of 
the differential equation (see Problem 11.3). 

Theorem 11.4 Assume that q,r G C[a,b] and q > 0. Then the boundary 
value problem for the linear differential equation 

— u” + qu = r on[a,b] (11.7) 

with homogeneous boundary conditions 

u(a) — u(b) = 0 (11.8) 

has a unique solution u € C 2 [a,b\. 

Proof. Assume that u\ and u<i are two solutions to the boundary value 
problem. Then the difference u = u\ — u? solves the homogeneous boundary 
value problem 

— u” -f qu — 0, u(a) — u(b) = 0. 

By partial integration we obtain 

pb rb 

/ ([u'] 2 + qu 2 ) doo — / (— u" + qu)udx — 0. 

J a J a 

This implies u' — 0 on [a, 6], since q > 0. Hence u is constant on [a, b], 
and the boundary conditions finally yield u = 0 on [a, 6]. Therefore, the 
boundary value problem (11.7)— (11.8) has at most one solution. 
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The general solution of the linear differential equation (11.7) is given by 

u = C\U\ 4- C 2 U 2 “h u*, (11.9) 

where u \ , U 2 denotes a fundamental system of two linearly independent so- 
lutions to the homogeneous differential equation, u* is a solution to the in- 
homogeneous differential equation, and C\ and C 2 are arbitrary constants. 
This can be seen with the help of the Picard-Lindelof Theorem 10.1 (see 
Problem 11.4). The boundary condition (11.8) is satisfied, provided that 
the constants C\ and C 2 solve the linear system 

C\Ui(a) + C 2 U 2 (a) = - u*(a ), 

Ciux(b) + C 2 u 2 (b) = -u*(b). 

This system is uniquely solvable. Assume that C\ and C 2 solve the homoge- 
neous system. Then u — C\U\ +C 2 U 2 yields a solution to the homogeneous 
boundary value problem. Hence u — 0, since we have already established 
uniqueness for the boundary value problem. From this we conclude that 
C\ — C 2 ~~ 0 because u\ and u 2 are linearly independent, and the exis- 
tence proof is complete. □ 



For the numerical solution, proceeding as in Example 2.1, we choose an 
equidistant grid 

Xj = a + jh , j — 0, . . . , n + 1, 

with the step size given by h — (b — a)/(n + 1) and n G IN. At the internal 
grid points xj, j — l,...,n, we replace the differential quotient in the 
differential equation by the difference quotient 

u"(Xj ) « ^ [u{x j+ 1 ) - 2u(Xj) + u(x,-_!)] 

to obtain the system of equations 

-jjj K?'- 1 - ( 2 + h 2 , lj)uj +Uj+i] =r jt j = 1, . . . ,n, (11.10) 



for approximate values Uj to the exact solution u(xj). Here we have set 
qj q(xj) and rj r(xj). The system has to be complemented by the 
two boundary conditions 

u 0 = u n + 1 = 0. (11*11) 

For an abbreviated notation we introduce the n x n tridiagonal matrix 



A 



h 2 



/ 2 + qih 2 



-1 

2 + q2h 2 — 1 
— 1 2 + q^h 2 



\ 



-1 

— 1 2 + g n _i/i 2 —1 

-1 2 + q n h 2 J 



\ 
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and the vectors U = (iq , . . . , u n ) T and R = (rq , . . . , r n ) T . Then our system 
of equations, including the boundary conditions, reads 



AU — R. 



( 11 . 12 ) 



The following two questions have to be answered: 

1. Is the system (11.12) uniquely solvable? 

2. How large is the error between the approximate solution uj and the 
exact solution u(xj)? Do we have convergence of the approximate 
solution to the exact solution as h — » 0? 

Theorem 11.5 For each h > 0 the difference equations (ll.lO)-(ll.ll) 
have a unique solution. 

Proof. The tridiagonal matrix A is irreducible and weakly row-diagonally 
dominant. Hence, by Theorem 4.7, the matrix A is invertible, and the Ja- 
cobi iterations converge. □ 

Recall that for speeding up the convergence of the Jacobi iterations we 
can use relaxation methods or multigrid methods as discussed in Sections 
4.2 and 4.3. 

The error and convergence analysis is initiated by first establishing the 
following two lemmas. 

Lemma 11.6 Denote by A the matrix of the finite difference method for 
q > 0 and by Aq the corresponding matrix for q = 0. Then 

0 < A -1 < A- 1 ; 

i.e., all components of A~ l are nonnegative and smaller than or equal to 
the corresponding components of Ag 1 . 

Proof. The columns of the inverse A -1 = (cq, . . . , a n ) satisfy Aaj — ej for 
j = 1, . . . , n with the canonical unit vectors e \ , . . . , e n in lR n . The Jacobi 
iterations for the solution of Az — ej starting with z 0 — 0 are given by 

z v +\ = -D~ l {A L + A r )z v + D~ l ej, i/ — 0, 1, ... , 

with the usual splitting A — D + Al+ Ar of A into its diagonal, lower, and 
upper triangular parts. Since the entries of D~ l and of — D~ 1 (Al + Ar) 
are all nonnegative, it follows that A~ l > 0. Analogously, the iterations 

— Dq ( Al Ar)z v -f- Dq ej y v = 0, 1, ... , 

yield the columns of A^ 1 . Therefore, from Dq 1 > D~ l we conclude that 




11.2 Finite Difference Methods 265 

Lemma 11.7 Assume that u € C 4 [a, b\. Then 

u ”( x ) - ^ [ u ( x + h ) - 2u(x) 4- u(x - h )] 

for all x € [a 4- h, 6 — h] . 

Proof. By Taylor’s formula we have that 

u(x ± h) = u(x) 4= hu f (x) 4- u"(:r) 4= u"'(:e) 4- — 4= 6±h) 

JL O 

for some 0± € (0, 1). Adding these two equations gives 

u4 h 4 

u(x4-/i)-2u(x)H-u(x-/i) = h 2 u"(x) + — u^(x+6+h) + — u^(x-0-h), 

whence the statement of the lemma follows. □ 

Theorem 11.8 Assume that the solution to the boundary value problem 
(11.7)-(11.8) is four-times continuously differentiable. Then the error of 
the finite difference approximation can be estimated by 

| u(xj) u j | < ^ ||u (4) ||oo(& - a) 2 , j = 

Proof. By Lemma 11.7, for 

Zj u ( Xj ) — [u , { x j+ 1 ) — 4- , u(xj_i)] 

we have the estimate 

M < ^ ll« (4) lloo, j = 1, . . . ,n. (11.13) 




Since 

~j^2 [ u ( x j+i)~ (24* fc )] = — u )-|-Zj = Vj+Zj, 

the vector U = (u(#i), . . . ,u(x n )) T given by the exact solution solves the 
linear system 

AU = R 4- Z, 

where Z = ( 2 q, . . . , z n ) T . Therefore, 

A(C7 - U) = Z , 



and from this, using Lemma 11.6 and the estimate (11.13), we obtain 
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where e = (l,...,l) T . The boundary value problem 
-Uo = 1, Mo(o) = u 0 (b) = 0, 

has the solution 

woW = - (x - a)(b - x). 

Since = 0, in this case, as a consequence of (11.14) the finite difference 
approximation coincides with the exact solution; i.e., e = AqU = AoU. 
Hence, 

11^0 e lloo < 1 1^0 1 |oo — g {b — O') 5 j — lj • • • , Tl. 

Inserting this into (11.14) completes the proof. □ 

Theorem 11.8 confirms that as in the case of the initial value problems 
in Chapter 10, the order of the local discretization error is inherited by 
the global error. Note that the assumption in Theorem 11.8 on the dif- 
ferentiability of the solution is satisfied if q and r are twice continuously 
differentiable. 

The error estimate in Theorem 11.8 is not practical in general, since it 
requires a bound on the fourth derivative of the unknown exact solution. 
Therefore, in practice, analogously to (10.19) the error is estimated from 
the numerical results for step sizes h and h/2. Similarly, as in (10.20), a 
Richardson extrapolation can be employed to obtain a fourth-order approx- 
imation. 

Of course, the finite difference approximation can be extended to the 
general linear ordinary differential equation of second order 

—u" + pu l -f qu — r 



by using the approximation 



u'{Xj) W — H X j+ i) - u(Zj-l)] 



(11.15) 



for the first derivative. This approximation again has an error of order 
0(h 2 ) (see Problem 11.9). Besides Richardson extrapolation, higher-order 
approximations can be obtained by using higher-order difference approxi- 
mations for the derivatives such as 



u"(x) « — ^[-u(x + 2 h) + 1 6u(x 4- h) 

1 2h z 

— 30u(x) H- 16 u(x — h) — u(x — 2 h)], 



(11.16) 



which is of order 0(/i 4 ), provided that u is six-times continuously differen- 
tiable (see Problem 11.9). 
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We wish also to indicate briefly how the finite difference approximations 
are applied to boundary value problems for partial differential equations. 
For this we consider the boundary value problem for 

— A u + qu = r in D (11.17) 

in the unit square D — (0, 1) x (0, 1) with boundary condition 

u — 0 on dD. (11.18) 

Here A denotes the Laplacian 

d 2 u d 2 u 
U ' dx 2 + dx\ 

Proceeding as in the proof of Theorem 11.4, by partial integration it can 
be seen that under the assumption q > 0 this boundary value problem 
has at most one solution. It is more involved and beyond the scope of this 
book to establish that a solution exists under proper assumptions on the 
functions q and r. We refer to [24, 60] and also the remarks at the end of 
Section 11.4. 

As in Example 2.2, we choose an equidistant grid 

Xij = (ihjh), i,j = 0, ...,n + 1, 

with step size h = l/(n-Fl) and n £ IN. Then we approximate the Laplacian 
at the internal grid points by 

A u(Xij) « ^2 {u(Xi+ ij) + u(Xi- ij) + u{Xi,j+ 1) + u(Xij- 1) - 4 u(Xij)} 
and obtain the system of equations 

j 2 [(^ "b h qij)Uij Ui-\-ij Hi— ^i,j- 1-1 ^i,j — l] 

n (11.19) 

i,j = l,...,n, 

for approximate values Uij to the exact solution u(xij). Here we have set 
Qij := q(xij) and r(xij ). This system has to be complemented by the 
boundary conditions 



Uoj = Mn+lj = 3 = 0, . . . , n + 1, 

0 “ 1 — 0 , l — 1 , . . . , 71 . 



( 11 . 20 ) 



We refrain from rewriting the system (11.19)-(11.20) in matrix notation 
and refer back to Example 2.2. Analogously to Theorem 11.5, it can be seen 
that the Jacobi iterations converge (and relaxation methods and multigrid 
methods are applicable). Hence we have the following theorem. 
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Theorem 11.9 For each h > 0 the difference equations (11.19)-(11.20) 
have a unique solution . 

From the proof of Lemma 11.6 it can be seen that its statement also holds 
for the corresponding matrices of the system (1 1 . 19)— (1 1 .20) . Lemma 11.7 
implies that 



Aw(xi ,x 2 ) - ^ \ u { x i + h, x 2) 4 u(x\ - h , x 2 ) -I- u(x\ ,x 2 + h) 



+u(x \ , x 2 - h) - 4u(xi , x 2 )] 



^ h 2 r 


d A u 




ld 4 d 


■ 


< — 

- 12 [ 


dx\ 


+ 

OO 


rP(M 

CO 


OO- 



provided that u G C 4 ([ 0, 1] x [0, 1]). Then we can proceed as in the proof 
of Theorem 11.8 to derive an error estimate. For this we need to have an 
estimate on the solution of 



— Auo = 1 in D, uq = 0 on dD. (11.21) 

Either from an explicit form of the solution obtained by separation of vari- 
ables or by writing 



u 0 (x) = - (1 - xi)xi 4- - (1 - x 2 )x 2 4- v 0 (x), 

where Vq is a harmonic function, i.e., a solution of Avo = 0, and employing 
the maximum minimum principle for harmonic functions (see [39]), it can 
be seen that ||ito||oo < 1/8 (see Problem 11.10). Hence we can state the 
following theorem. 



Theorem 11.10 Assume that the solution to the boundary value problem 
(11.17)-(11.18) is four-times continuously differentiable . Then the error of 
the finite difference approximation can be estimated by 



I u{Xij) - Uij\ < 



96 



' d 4 U 




d 4 u 




d x f 


4 

OO 


dx\ 


OO- 



i,j = 1 



11.3 The Riesz and Lax-Milgram Theorems 

To establish the foundation of finite element methods for boundary value 
problems we need to extend our tools from functional analysis. 

Theorem 11.11 (Riesz) Let X be a Hilbert space. Then for each bounded 
linear function F : X — > € there exists a unique element f G X such that 



F(u) = ( u,f ) 



( 11 . 22 ) 
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for all u G X. The norms of the element f and the linear function F 
coincide; i.e., 

ll/ll = 11*11- (H-23) 

Proof Uniqueness follows from the observation that because of the positive 
definiteness of the scalar product, / = 0 is the only element representing 
the zero function F = 0 in the sense of (11.22). For F / 0 choose w G X 
with F(w) 7^ 0. Since F is continuous, the nullspace 

N(F) = {u G X : F(u) = 0} 

can be seen to be a closed, and consequently, by Remark 3.40, a complete, 
subspace of the Hilbert space X. By the approximation Theorem 3.52 there 
exists the best approximation v to w with respect to N(F). By Theorem 
3.51 it satisfies w — v ± N(F). Then for g := w — v we have that 

(F{g)u - F(u)g, g) = 0, u € X, 

since F(g)u — F(u)g G N(F) for all u G X. Hence, 




for all uGl, which completes the proof of (11.22). 

From (11.22) and the Cauchy-Schwarz inequality we have that 

|F( U )|< ll/ll |H|, u€ x, 

whence ||F|| < ||/|| follows. On the other hand, inserting / into (11.22) 
yields 

ii/ii 2 = *(/)< ii*ii ii/ii, 

and therefore ||/|| < ||F||. This concludes the proof of the norm equality 
(11.23). □ 

Definition 11.12 A linear operator A : X — > X in a pre-Hilbert space X 
is called strictly coercive if there exists a constant c > 0 such that 

R e(Au 1 u) > c\\u\\ 2 (11.24) 



for all u G X . 

Theorem 11.13 (Lax-Milgram) In a Hilbert space X a bounded and 
strictly coercive linear operator A : X — > X has a bounded inverse 
A~ l : X -* X. 

Proof Using the Cauchy-Schwarz inequality, we can estimate 
\\Au\\ ||it|| > R e(Au,u) > c||u|| 2 . 
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Hence 

\\M\ > c||«|| (11.25) 

for all u G X. From (11.25) we observe that Au — 0 implies u — 0; i.e., A 
is injective. 

Ne x t we sh ow that the range A(X) is closed. Let v be an element of the 
closure A(X) and let (v n ) be a sequence from A(X) with v n -> v, n — > oo. 
Then we can write v n = Au n with some u n G X, and from (11.25) we find 
that 

C\\u n - Um\\ < \\v n - V m \\ 

for all n, m G IN. Therefore, (u n ) is a Cauchy sequence in X and converges: 
u n — > u, n - > oo, w ith some u G X. Then v = Au , since A is continuous, 
and A{X) — A{X) is proven. 

From Remark 3.40 we now have that A(X) is complete. Let w G X be 
arbitrary and denote by v its best approximation with respect to A(X ), 
which uniquely exists by Theorem 3.52. Then, by Theorem 3.51, we have 
(w — v,u) — 0 for all u G A(X). In particular, (w — v, A(w — v)) — 0. Hence, 
from (11.24) we see that w = v G A(X). Therefore, A is surjective. Finally, 
the boundedness of the inverse 



p- 1 !! < - c (11.26) 

is a consequence of (11.25). □ 

Definition 11.14 Let X be a complex (or real) linear space. Then a func- 
tion S : X x X -» C (or IR^ is called sesquilinear if it is linear with respect 
to the first variable and antilinear with respect to the second variable , i.e., 

if 

S(au -f (3v,w ) = aS(u,w) -f (3S{v,w) 



and 

S(u , av + (3w) = aS(u , v) + j3S(u , w) 

for all u,v,w G X and a,/3 G C (or 1R/ A sesquilinear function on a 
normed space X is called bounded if 

|5(«,«)|<ciniihi 

for all u,v G X and some positive constant C. It is called strictly coercive 

if 

R eS(u,u) > c\\u\\ 2 

for all u G X and some positive constant c. 



Note that for a real linear space, sesquilinear functions are bilinear func- 
tions, i.e., linear with respect to both variables. Each bounded and strictly 




11.3 The Riesz and Lax-Milgram Theorems 271 



coercive linear operator A : X -» X in a pre-Hilbert space defines a 
bounded and strictly coercive sesquilinear function by 

S(u,v) := (u,Av), u,v G X. 

The converse of this statement is described by the following theorem. 

Theorem 11.15 Let S be a bounded and strictly coercive sesquilinear func- 
tion on a Hilbert space X. Then there exists a uniquely determined bounded 
and strictly coercive linear operator A : X — > X such that 

S(u , v) = (u, Av) 



for all u,v G X . 

Proof For each v G X the mapping u S(u,v) clearly defines a bounded 
linear function on X, since |5(w,v)| < C f ||w||||i;||. By the Riesz Theorem 
11.11 we can write 5(u,u) = (u,f) for all u G X and some / G X. 
Therefore, setting Av := f we define an operator A : X -> X such that 
S(u,v) = (u,Av) for all u,v G X. 

To show that A is linear we observe that 

(w, aAv 4- 0Aw) = a(u , Av) -f j3(u, Aw) = aS(u , v) 4- j3S(u, w) 

— S(u , av 4- f3w) = (u, A[av 4- /3w]) 

for all u,v,w G X and all a,f3 G C. The boundedness of A follows from 

||Att|| 2 — (Au,Au) = S(Au,u) < (7||Am|| ||m||, 

and the strict coercivity of A is a consequence of the strict coercivity of S. 

To show uniqueness of the operator A we suppose that there exist two 
operators A\ and A 2 with the property 

S(u,v) = (u,Aiv) = (u,A 2 v) 

for all u,v G X. Then we have (u,A\v — ^ 2 ^) = 0 for all u,v G X, which 
implies A\v = A 2 v for all v G X by setting u — A\v — A 2 v. □ 

Corollary 11.16 Let S be a bounded and strictly coercive sesquilinear 
function and F a bounded linear function on a Hilbert space X. Then there 
exists a unique u G X such that 

S(v,u) = F(v) (11.27) 



for all v G X . 

Proof. By Theorem 11.15 there exists a uniquely determined bounded and 
strictly coercive linear operator A such that 



S(v, u) — (v, Au ) 
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for all and by Theorem 11.11 there exists a uniquely determined 

element / such that 

F(v) = (v,f) 

for all v £ X. Hence, the equation (11.27) is equivalent to the equation 

Au — /. 

However, the latter equation is uniquely solvable as a consequence of the 
Lax-Milgram Theorem 11.13. □ 

Since the coercivity constants for A and S coincide, from (11.23) and 
(11.26) we conclude that 

imi < \ imi (n- 28 ) 

for the unique solution u of (11.27). 

Let A : X -» X be a bounded linear operator. Then, given / £ X, 
solving the equation Au = / obviously is equivalent to finding uGl such 
that 

(v,Au) = (vJ) (11.29) 

for all v £ X . The Galerkin method , named after the Russian engineer 
Galerkin, is based on this observation, and given a finite-dimensional sub- 
space X n C X, it approximately solves (11.29) by an element u n £ X n 
such that 

(v,Au n ) = (v,f) (11.30) 

for all v £ X n . By Theorems 3.51 and 3.52, the condition (11.30) is equiva- 
lent to the fact that the best approximations to Au n and to / with respect 
to X n coincide; i.e., 

P n Au n = P n f , (11.31) 

where P n denotes the orthogonal projection operator from X onto X n . 
The equivalence of (11.30) and (11.31) is the reason why the Galerkin 
method belongs to the so-called projection methods ; i.e., the equation to be 
approximated is projected onto a finite-dimensional subspace. 

To analyze the Galerkin method we introduce a finite-dimensional oper- 
ator A n : X n -> X n by A n := P n A. Then, by Theorem 3.51, we have 

(A n u,u) = (P n Au,u) = ( Au,u ) + ( P n Au — Au,u) = (Au,u) 

for all u £ X n . Hence from the strict coercivity of A we deduce that 

Re(A n u,u) > c\\u\\ 2 

for all u £ X n \ i.e., A n : X n X n is strictly coercive with the same 

coercitivity constant c as A : X X. This now can be employed to prove 

the following theorem. 
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Theorem 11.17 For a bounded and strictly coercive linear operator A the 
Galerkin equations (11.30) have a unique solution. It satisfies the error 
estimate 

\\u n “ u\\ < M inf ||r> — u||, (11.32) 

v£X n 

where M is some constant depending on A (and not on X n ). 

Proof. Since A n : X n -» X n is strictly coercive with coercitivity constant 
c, by the Lax-Milgram Theorem 11.13 we conclude that A n is bijective; 
i.e., the Galerkin equations (11.30) have a unique solution u n £ X n . The 
estimate (11.26) applied to the operator A n implies that 

Ill'll < \ • (H-33) 

For the error u n — u between the Galerkin approximation u n and the 
exact solution u we can write 

U n -u - ( A~ l P n A - I)u = {A-'PnA - /)(« - v) 

for all v £ X n , since, trivially, we have A~ l P n Av — v for v £ X n . By 
Theorem 3.52 we have ||P n || = 1, and therefore, using Remark 3.25 and 
(11.33) we can estimate 

\\A- l P n A\\ < \ ||A||, 

whence (11.32) follows. □ 

The error estimate of Theorem 11.17 is usually referred to as Cea’s 
lemma , since it was first obtained by Cea in 1964. It indicates that the 
error in the Galerkin method is determined by how well the exact solution 
can be approximated by elements of the subspace X n . 

By Corollary 11.16 the Galerkin method immediately carries over to the 
solution of the sesquilinear equation (11.27) and consists in finding u n £ X n 
such that 

S(v,u n ) = F(v) (11.34) 

for all v £ X n . 

The practical solution of the Galerkin equations (11.30) reduces to the 
solution of a system of linear equations. If w\,. . . ,w n is a basis for X n 
(without loss of generality we assume the dimension of X n to be n), then 
for 

n 

^ ^ OtfcWk 
k= 1 

the Galerkin equations (11.30) are equivalent to the system of linear equa- 
tions 

n 

^2a k (wj,Aw k ) = j = l,...,n. (11.35) 

k = 1 
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From this formulation it becomes obvious that the Galerkin method is 
only a semidiscrete method , since setting up the linear system requires 
the evaluation of scalar products and of the operator A applied to the 
basis elements. For a fully discrete method these computations, in general, 
need further approximations of integrals for the scalar products and of 
differential or integral operators. This also requires that the error analysis 
be amended accordingly, since the error estimate of Theorem 11.17 covers 
only the semidiscrete case. 

Having outlined the basic ideas of the Galerkin method and its error 
analysis within a few paragraphs, we want to point out clearly that the 
power and the art of the application of the Galerkin method for the ap- 
proximate solution of differential and integral equations begins with the 
proper choice of the approximating subspace X n and the appropriate basis 
uq, . . . , w n therein, corresponding to the operator A under consideration. 
However, it is beyond our goal to enter into this important topic in any 
detail aside from the short discussion in Section 11.5. 
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We return to the boundary value problem, and instead of (1 1 .7)— (1 1 .8) we 
consider the slightly more general so-called Sturm-Liouville problem 

— (pu'y + qu = r in [a, b ] (11.36) 

with homogeneous boundary conditions 



u{a) = u(b ) = 0. (11.37) 

Here we assume that p 6 C l [a, b] and <7, r £ C[a,b] such that p(x) > 0 
and q(x) > 0 for all x £ [a, b]. Multiplying the differential equation by 
v and performing a partial integration, it follows that each solution u to 
(11.36)-(11.37) satisfies 

S(v J u) = F{v) (11.38) 

for all v € C l [a,b\ with v(a) = v(b) = 0, where we have set 




(pu'v 1 -f quv) dx 



(11.39) 




(11.41) 
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for all v E C l [a,b\ with u(a) = v(b) = 0. Now we set / ( pu ')' — qu + r 

and assume that f(x o) ^ 0 for some xq in (a, 6), say f(x o) > 0. Since / is 
continuous, there exists an interval U C (a, b) such that / is positive on U . 
Now we choose a nonnegative function v ^ 0 from C7 1 [o, 6] which vanishes 
outside U. For this function v the integral in (11.41) must be positive. This 
is a contradiction, and therefore / must vanish identically; i.e., u satisfies 
the differential equation (11.36). Therefore, (11.38) provides an equivalent 
reformulation of the boundary value problem. 

From Example 3.38 we recall that the space of continuous functions is 
not complete with respect to the L 2 scalar product. However, if we wish 
to apply the analysis of the previous section and, in particular, Corollary 
11.16, then we need a Hilbert space. For this, we introduce the Sobolev space 
H 1 [a, b] based on the concept of weak derivatives. By L 2 [a, b] we denote the 
space of measurable real- valued functions defined on the interval [a, b] that 
are square-integrable in the sense of Lebesgue. We shall make use of the 
fact that L 2 [a,b\ is a Hilbert space with respect to the L 2 scalar product. 
(More precisely, L 2 [a, b] is the linear space of equivalence classes of functions 
coinciding almost everywhere.) Note that the space C[a,b] of continuous 
functions is dense in L 2 [a, b] (see [5, 51, 59]). 

Definition 11.18 A function u E T 2 [a, b] is said to have a weak derivative 
u f E L 2 [a, 6] if 

pb pb 

/ uv'dx = — u'vdx (11.42) 

J a J a 

for all v E C l [a,b] with v(a) = v(b) = 0. 



By partial integration it follows that (11.42) is satisfied for functions 
u E C x [a, b]. Hence, weak differentiability generalizes classical differentia- 
bility. 

From the denseness of {v E C l [a , b] : v(a) — v(b) = 0} in L 2 [a , 6], or from 
the Fourier series for the odd extension of u, it can be seen that the weak 
derivative, if it exists, is unique (see Problem 11.17). From the denseness of 
C [a, b] in L 2 [a, b], or from the Fourier series for the even extension of u, it 
follows that each function with vanishing weak derivative must be constant 
almost everywhere (see Problem 11.17). The latter, in particular, implies 



u(x) = f u f (^) d£ + c 



(11.43) 



for almost all x E [a, 5] and some constant c, since by Fubini’s theorem 






(x) dx d£ 
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for all v E C l [a, b ] with v(a) = v(b) = 0. Hence both sides of (11.43) have 
the same weak derivative. 

Theorem 11.19 The linear space 

H l [a, 6] := {« € L 2 [a, &]:«'€ L 2 [a, 6]} 

endowed with the scalar product 

r b 

( u '> v )h 1 •*— / (uv + u'v')dx (11.44) 



is a Hilbert space . 

Proof. It is readily checked that l? 1 [a, b] is a linear space and that (11.44) 
defines a scalar product. Let (u n ) denote an H 1 Cauchy sequence. Then 
(u n ) and (u' n ) are both L 2 Cauchy sequences. From the completeness of 
L 2 [a,b] we obtain the existence of u E L 2 [a,b ] and w E L 2 [a, 6] such that 
|| u n - u ||2 -> 0 and \\u' n — w\\ 2 -» 0 as n -> oo. Then for all v E 6] 
with v(a) — v{b) — 0 we can estimate 

/*6 /»& 

/ (m/ + wv) dx — / {(u — u n )^ -f (ic — u^)v} dx 

J a J a 

< I|w - u n \\ L 2 \\v '\\ L 2 + ||w - <|MMU 2 -> 0, n oo. 

Therefore, u € if 1 [a, 6] with u' — w, and ||u — Unll/f 1 — > 0, n -» oo, which 
completes the proof. □ 

Theorem 11.20 C x [a, b] is dense in H l [a,b\. 

Proof. Since C[a, b ] is dense in L 2 [a, 6], for each u E H 1 [a, 6] and e > 0 there 
exists w E C[a, 6] such that H^' — u;|| 2 < e. Then we define v E C^a, 6] by 

v(x) := u(a) 4- f w(£) d£, 



and using (11.43), we have 



u(x)-v(x)= f {u'{0 

J a 



By the Cauchy-Schwarz inequality this implies \\u — t >||2 < (b — a)e , and 
the proof is complete. □ 

Theorem 11.21 H l [a, b] is contained in C[a,b], 
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Proof. Prom (11.43) we have 



u(x) — u 




«'(Ode, 



whence by the Cauchy-Schwarz inequality, 

|u(x) -u(y ) I < \x-y\ l/2 \\u'\\ 2 



(11.45) 



follows for all x, y G [a, b\. Therefore, every function u G H l [a, b] belongs to 
C[a, 6], or more precisely, it coincides almost everywhere with a continuous 
function. □ 



By Theorem 11.21 we may consider H l [a, b] as a subspace of C[a, b\. 
Choose y G [a, b] such that \u(y)\ = min a < x <b 1^0*01- Then from 

(b — a) min 

a<x<b 

and (11.45), by the Cauchy-Schwarz inequality we find that 

HU < c'll^llffi 

for some constant C. The latter inequality means that the H 1 norm is 
stronger than the maximum norm (in one space dimension!). 

Theorem 11.22 The space 

#o[a, 6] := {u G H l [a,b] : u(a) = u(b) = 0} 

is a complete subspace of H l [a,b\. 

Proof. Since the H l norm is stronger than the maximum norm, each H l 
convergent sequence of elements of H^[a, 6] has its limit in Hq[cl , 6]. There- 
fore Hq [a, 6] is a closed subspace of H 1 [a, 6], and the statement follows from 
Remark 3.40. □ 

Definition 11.23 A function u G #o[a,6] is called a weak solution to 
the boundary value problem (11.36)-(11.37) if (11.38) is satisfied for all 
v G Hl[a,b}. 

Theorem 11.24 Assume thatp > 0 and q > 0. Then there exists a unique 
weak solution to the boundary value problem (11.36)~(11.37). 

Proof. The sesquilinear function S : Hq[o, 5] x H^la^b] is bounded, since 
|S(u,v)| < max {||p||oo,Nloo} IMMMItfi 
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by the Cauchy-Schwarz inequality. For u G #o[a,6], from (11.45) and the 
Cauchy-Schwarz inequality we obtain that 






dx <{b — aYWu'W 2 ^- 



Hence we can estimate 



S(u.u) > min p(x) / \u'\ 2 dx> 

- a<x<b Ja 



c u 






for all u G Hq [a, 6] and some positive constant c; i.e., S is strictly coercive. 
Finally, by the Cauchy-Schwarz inequality we have 

\F(v)\ < ||r|| L 2 |H | L 2 < llrll^Hull^i; 



i.e., the linear function F : lf f \ [a, b] — > IR is bounded. Now the statement 
of the theorem follows from Corollary 11.16. □ 



We note that from (11.28) and the previous inequality it follows that 

IMIffi < * Iklli 2 (11.46) 

for the weak solution u to the boundary value problem (11.36)-(11.37). 

Theorem 11.25 Each weak solution to the boundary value problem (11.36)- 
(11.37) is also a classical solution; i.e., it is twice continuously differen- 
tiable. 

Proof. Define 



/(*) 



[ [g(CMO - HO] <%, 

J a 



x G [a, b\. 



Then / G C l [a,b\. From (11.38), by partial integration we obtain 




u' — f]v* dx = 0 



for all v G H$[a, b\. Now we set 



-L- /V' 

a Ja 



f]d£ 



V 0 (x) := ( \p{0u'{0 ~ f(0 ~ c] dd, 
J a 



x G [a, b\. 



and 
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Then Uo £ Hl\a,b\ and 

rb 

[pu 1 - / - c] 2 dx 



f 




- / - c]v' 0 dx 




Vq dx — 0. 



Hence 



pu' - f + c, 

and since / and p are in C l [a,b] with p(x) > 0 for all x £ [a, 6], we can 
conclude that u' £ C l [a, b] and 

(pu)' = f — qu — r. 



This completes the proof. 



□ 



Using the differential equation (11.36), from (11.46) we conclude that 
there exists a constant C > 0, independent of r, such that 

\\u"\\L*<C\\r\\ L 2 9 (11.47) 

which we note for later use. 

As compared to Theorem 11.4 we have not obtained any major extension 
of the existence result. However, as pointed out already in the introduction, 
we view this section as a model case for the more complicated situation of 
partial differential equations. By partial integration it can be seen that 
the boundary value problem (1 1.17)— (11.18) for the Laplace operator is 
equivalent to finding a function u £ C 2 (D) satisfying u — 0 on dD and 

/ ([grad u] T grad v + quv)dx = / fvdx (11.48) 

Jd Jd 

for all v £ C l {D) with v = 0 on dD. The analysis of weak solutions 
of (11.48) follows the same pattern as for the ordinary differential equa- 
tion (11.36). However, the details are more heavily involved. In particular, 
since for the multidimensional case the Sobolev space H 1 ( D ) no longer is a 
subspace of the continuous functions, the formulation of the boundary con- 
dition, i.e., the definition of the subspace Hq(D ), has to be modified, and 
establishing that weak solutions are also classical solutions is more com- 
plicated. For a comprehensive study of weak solutions to boundary value 
problems for elliptic partial differential equations we refer to [24, 60]. 



11.5 The Finite Element Method 

The finite element method for the boundary value problem (11.36)-(11.37) 
consists in the application of the Galerkin method (11.34) to the weak 
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formulation (11.38) by using spline spaces as approximating subspaces. 
Then, for appropriate basis functions, the matrix S(wj 1 Wk) will be sparse; 
i.e., most of the matrix entries will be zero. Polynomials as approximating 
subspaces are not suitable, since analogously to Example 5.1, they lead to 
ill-conditioned linear systems with full matrices. 

We consider the case of linear splines. For the equidistant grid 



Xj a -f j/i, j = 0, . . . ,ra + 1, 



with step size h = (b — a)/ (n + 1) and n € IN we choose for X n the space 
of continuous piecewise linear functions; i.e., X n consists of the functions 
u € C[a, b] that satisfy u(a) = u(b) — 0 and coincide on each subinterval 
[xj-i,Xj] with a polynomial in Pi for j = 1, . . . ,n. The functions in this 
spline space belong to H^[a, b] with piecewise constant weak derivatives. 
As basis elements in X n we take the so-called hat functions 



w k (x) 



^ (x 

| £ (Xk+1 - x), 



x e [x k -i,x k ], 



X ^ \p^k 5 l] 5 



0, 



X & [Xk-uXk+l]. 



Each u G X n can be represented in the form 

n 

U - 5 ~2a k w k , 

fc=i 



where a* = u(xk), k = 1, . . . , n. Obviously, we have 

S(wj,Wk) = / {pWjWfr -I- qwjWk}dx — 0 
J a 

if (xj-i,Xj+i) fl (xk-uXk+i) — 0 , i.e., if | j - k\ > 2. Therefore, the matrix 
S(wj,Wk) is tridiagonal. We compute the matrix elements 

l r x j+i 

S{wj,Wj) = J p(x)dx 




q(x)(x — xj-i) 2 dx + 



r x i + 1 

/ 9( 

J Xj 



x){xj + 1 — x) 2 dx 



and 

S(wj,Wj+i) = S(wj+uWj) 



— 1 f Xj+1 1 r x j+ i 

= ~h? J P(x)dx + -^ j q{x){xj + 1 



x)(x — Xj) dx , 
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and the right-hand sides 

l ( r x i r x j + 1 ) 

F(wj) = - < / r(x)(x — xj-i)dx + / r(x)(xj + 1 — x) dx > . 

J X j J 

These equations illustrate two general features of the finite element meth- 
ods. Firstly, it is characteristic for the finite element method that the co- 
efficients are computed by the same formula for each subinterval, i.e., for 
each of the finite elements into which the total interval is subdivided. 

Secondly, as already mentioned earlier, the Galerkin method is only 
semidiscrete. In order to make it fully discrete, a numerical quadrature 
has to be applied. If we remain within our framework of approximations 
and approximate p, q , and r by linear splines, we obtain 

S(wj,wj) « — (pj-i + 2 pj +Pj+i) + — (^'-i + 6qj + tfj+i) 

and ^ ^ 

S(wj,w j+ 1 ) « — (jpj+Pj+i) + (9j + qj+ 1 ) 

for the matrix elements, and 



F ( w j) K + 4r i + r i+i) 



for the right-hand sides. Here, as above, we have set pj — p(xj ), qj = q(xj ), 
and rj = r(xj). Similar to the linear system (11.10)— (11.11) for the finite 
difference method, the tridiagonal linear system is irreducible and weakly 
row-diagonally dominant. It also is accessible to convergence acceleration 
of the Jacobi iterations by relaxation and multigrid methods. 

In order to derive an error estimate for the semidiscrete version of the 
finite element method with linear splines from Theorem 11.17, we need an 
estimate for the interpolation error for linear splines with respect to the 
H 1 norm (see also Theorem 8.33). 



Lemma 11.26 Let f[a,b] E C 2 [a, b]. Then the remainder R\f := / — L\f 
for the linear interpolation at the two endpoints a and b can be estimated 
by 

II^i/IIl 2 < (6-«) 2 ||/"|| i2 , 



ll(fli/)'IU 2 < (6-o) \\f"\\L>. 



(11.49) 



Proof. For each function g E C 1 [a, b] satisfying g(a) — 0, from 
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by using the Cauchy-Schwarz inequality we obtain 

\g(x)\ 2 < (b- a)\\g'\\l 2 , x € [a, 6]. 

Prom this, by integration we derive the Friedrich inequality 

IMIi 2 < (b - a)\\g '\\ L 2 (11.50) 

for functions g E C x [a,b] with g(a ) = 0 (or g(b) = 0). Using the interpola- 
tion property ( R\f)(a ) = (Rif)(b) = 0, by partial integration we obtain 

[ b \f'-(L 1 f)fdx= f nu-fidx. 

J a J a 

From this, again applying the Cauchy-Schwarz inequality, we have 

\\(Rjy\\h<\\n\L>\\Rif\\L>, 

whence (11.49) follows with the aid of Friedrich’s inequality (11.50) for 

g = Rif- □ 

Theorem 11.27 The error in the finite element approximation by linear 
splines for the boundary value problem (11.36)-(11.37) can be estimated by 

\Wn ~ u\\ H 1 < C\\u"\\L2h (11.51) 

for some positive constant C . 

Proof By summing up the inequalities (11.49), applied to each of the 
subintervals of length h, for the interpolating linear spline w n E X n with 
w n (xj) — u(xj) for j = 0, . . . , n we find that 

Ikn - w'lli 2 < \W'\\ L *h 

and 

||«>n - u\\ L n < \\u"\\ L ih 2 , 

whence 



inf ||t; - u\\ H i < || w n - u\\ H i < (1 + b - a)\\u"\\ L 2 h 

v€X n 

follows. Now (11.51) is a consequence of the error estimate for the Galerkin 
method of Theorem 11.17. □ 

By the following trick, which was independently developed by Aubin 
(1967) and Nitsche (1968), we can improve the error estimate in the L 2 
norm to the order 0(h 2 ) that we expect for approximations using linear 
splines. 
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Theorem 11.28 The error in the finite element approximation by linear 
splines for the boundary value problem (11.36)-(11.37) can be estimated by 

\\ u n ~ u \\ L 2 < C 1|«"|| ta /» 2 

with some positive constant C. 

Proof Denote by z n the weak solution to the boundary value problem with 
the right-hand side u — u n \ i.e., 

S(v,Z n ) - (v,u - u n ) L 2 

for all v £ fl ( \ [«, 6]. In particular, inserting v = u — u n , it follows that 

S{u-u n ,z n ) -\\u-u n \\ 2 L 2 . (11.52) 

Since S(v,u) = F(v) and S(v,u n ) = F(v) for all v E X n , using the sym- 
metry of S we have 

S(u — u n ,v) = 0 

for all v E X n . Inserting the Galerkin approximation to z n , which we denote 
by z n , into the last equation and subtracting from (11.52), we obtain 

Uu U n ||^2 — S(u Ufij Z n Zn)' (11.53) 

Since S is bounded, from (11.53) and (11.51), applied to u — u n and z n — z n , 
we can conclude that 

< C7 1 ||tx"|U 2 ||<'|| jL2 /i 2 

for some constant C\. However, from (11.47) we also have that 

\K\\^<C 2 \\u-u n \\ L 2 

for some constant C 2 . Now the assertion of the theorem follows from the 
last two inequalities. □ 

We refrain from describing both the extension of this analysis to higher- 
order splines such as cubic splines (see Problem 11.19) and the extension 
to partial differential equations. For the latter we refer to [4, 11]. 



Problems 

11.1 Consider multiple shooting for the boundary value problem 
u" 4- u = 0, u{a) = u(b) = 0, 

with n equidistant subintervals. Show that the corresponding linear system (11.6) 
is uniquely solvable, provided that {b — a)/ 7r & IN. 
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11.2 Write a computer program for multiple shooting using the Newton method 
and the Runge-Kutta method and test it for various examples. 

11.3 Show that the boundary value problem for the differential equation 

u" = /(x, u, u ), a < x < 6, 

with inhomogeneous boundary conditions u(a) = a and u(b) = (3 can be equiva- 
lently transformed into a boundary value problem with homogeneous boundary 
condition. 

11.4 Show that the general solution of the linear differential equation (11.7) is 
given by (11.9). 

11.5 For p £ C x [a,b] and q £ C[a,b] show that the boundary value problem 

u 4- pu 4- qu = r in [a, 6], u(a) = u(b) = 0, 

is solvable for each right-hand side r £ C[a, b] if and only if the boundary value 
problem 

u" 4- pu 4- qu = 0 in [a, 6], u(a) = u(b) = 0, 
admits only the trivial solution u = 0. 

11.6 Find the solution of the boundary value problem 

u”(x) 4- u(x) = e x , u(0) = u(l) = 0. 

11.7 Write a computer program for the finite difference method (11.10)— (11.11) 
and test it for various examples. 

11.8 Find the explicit solution for the finite difference approximation (11.10)— 
(11.11) for the boundary value problem 

u" — u — —2 in [0, 1], u(0) = u{ 1) = 0, 

and verify the convergence result of Theorem 11.8. 

11.9 Show that the error in the finite difference approximation (11.15) is of 
order 0(h 2 ) and that the error in the approximation (11.16) is of order 0(h A ). 

11.10 Prove the estimate ||wo||oo < 1/8 for the solution to the boundary value 
problem (11.21). 

11.11 In the space C[a,6] with scalar product 




define a functional F : (7 [a, b] -» <D by 

f b 

F(u) := I u(x)dx . 

J a 

Show that F is linear and bounded. Is there an / £ C[a, b] such that F(u) = ( u , /) 
for all u £ C[a,6]? Does your answer agree with the Riesz Theorem 11.11? 
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11.12 In the pre-Hilbert space of Problem 11.11, for a fixed x G [a, b] consider 
the point evaluation functional F : (7 [a, b] — > C defined by 

F(u) := u(x). 



Is F linear and bounded? 

11.13 Let X and Y be Hilbert spaces and let A : X — » Y be a bounded linear 
operator. Show that there exists a uniquely determined bounded linear operator 
A* : Y X such that 

( Au , v) = (u, A* v) 

for all u G X and v G Y. The operator A* is called the adjoint operator of A. 
Show that \\A\\ = ||A*||. 

11.14 Let A : X — > X be a bounded, self-adjoint, and positive operator in 
a Hilbert space X\ i.e., ( Au,v ) = (u,Av) for all m, v G X and (Aw, m) > 0 
for all u ^ 0. Choose wo G X and define Wj = Awj-i for j = 1, ... , n — 1. 
Show that the Galerkin equations for Au = / with respect to the subspaces 
X n = span{u>o, • • • ,u/ n - 1 } are uniquely solvable for each n G IN. Moreover, if / 
is in the closure of span^ico : j = 0, 1, . . .}, then the Galerkin approximation 
u n converges to the solution of Au = /. 

Show that in the special case wq = f the approximations u n can be computed 
iteratively by the formulae uq = 0, po — /, and 

Un -\- 1 — U n Q n p n , 

Pn — r n 4“ (3 n — 1 Pn— 1 1 

r n — r n — l Q!n— \Ap n — i, 

OLn - 1 = (r n -l,Pn-l)/(^Pn-l,Pn-l), 
fin — 1 = -(r n , Ap n -l)/(Ap n -i,Pn-l). 

Here r n is the residual r n = Au n — /. This is the conjugate gradient method of 
Hestenes and Stiefel. 

11.15 Let A : X X be a bounded, self-adjoint, and positive operator in a 
Hilbert space X ; i.e., (Au,v) = (ii, At?) for all u,v € X and (Am, u) > 0 for all 
u / 0. Show that solving the equation Am = / is equivalent to minimizing the 
so-called energy functional 

E(v) (t>, Av) — 2 Re(t>, /) 

on X. Show that the Galerkin approximation with respect to a subspace X n is 
equivalent to minimizing E on X n . This method is known as the Rayleigh-Ritz 
method. 

11.16 Show that under the assumptions of Problem 11.15 for the Galerkin 
equations the SOR method of Section 4.2 converges for 0 < lu < 2. 

11.17 Show that the weak derivative, if it exists, is unique and that each func- 
tion with vanishing weak derivative must be constant almost everywhere. 
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11.18 Write a computer program for the finite element method with linear 
splines and test it for various examples. Compare the numerical results with 
those for the finite difference method. 

11.19 Let B-i, Bo, Bi , . . . , B n , -B n +i, B n + 2 denote the cubic B-splines for the 
equidistant grid xj := a + jh, j = 0 , . . . , n-f 1, with step size h = ( b — a)/n . Show 
that 

uo := Bo — 4B_i, u\ B\ — B~ i, 

U 2 := B 2 , . . . , u n - i := -Bn- 1 , 

Un •— -^n+2) Un + 1 •— B n 4:B n -\-2 

is a basis for S 3 fl i/o[a, 6 ], i.e., for the space of cubic splines vanishing at the 
endpoints. 

Using this basis, set up the Galerkin equations for the Sturm-Liouville problem 
analogous to the case of linear splines treated in Section 11.5. 

11.20 Formulate and prove analogues of Theorems 11.27 and 11.28 for the finite 
element approximation using cubic splines as in Problem 11.19. 




12 

Integral Equations 



The topic of the last chapter of this book is 
which 



f 



K(x,y)ip(y)dy = f(x), 



linear integral equations, of 
x E [a, 6], 



and 



ip(x) - ( K(x,y)if(y)dy = f(x), x £ [a, 6], 



are typical examples. In these equations the function ip is the unknown, 
and the so-called kernel K and the right-hand side / are given functions. 
The above equations are called Fredholm integral equations of the first and 
second kind , respectively. Since both the theory and the numerical approx- 
imations for integral equations of the first kind are far more complicated 
than for integral equations of the second kind, we will confine our presen- 
tation to the latter case. 

Integral equations provide an important tool for solving boundary value 
problems for both ordinary and partial differential equations (see Problem 
12.1 and [39]). Their historical development is closely related to the solution 
of boundary value problems in potential theory in the last decades of the 
nineteenth century. Progress in the theory of integral equations also had a 
great impact on the development of functional analysis. 

Omitting the proofs, we will present the main results of the Riesz theory 
for compact operators as the foundation of the existence theory for integral 
equations of the second kind. Then we will develop the fundamental ideas 
of the Nystrom method and the collocation method as the two most im- 
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port ant approaches for the numerical solution of these integral equations. 
This is done in a general framework of operator equations and their ap- 
proximate solution, which makes the analysis more widely applicable. For 
a comprehensive study of both the theory and the numerical solution of 
linear integral equations we refer to [39]. 



12.1 The Riesz Theory 



This section is devoted to a summary of some of the basic facts of the theory 
of Fredholm integral equations of the second kind. The integral equations 
formulated above carry the name of Fredholm, since in 1902 Fredholm 
established an existence theory for integral equations of the second kind 
with continuous kernels, which is now known as the Fredholm alternative. 
For the purpose of this introduction to the numerical solution of integral 
equations it suffices to consider only the first and most important part of 
this alternative, which states that the inhomogeneous equation 

- f K{x,y)<p{y)dy = f(x), ze[a,6], (12.1) 

J a 

with continuous kernel K has a unique solution p E C[a , 6] for each right- 
hand side / E C[a, b] if and only if the homogeneous integral equation 



(f{x)-f K(x,y)ip(y)dy = 0, x€[a,6], 



( 12 . 2 ) 



has only the trivial solution. The importance of this result originates from 
the fact that it reduces the difficult problem of establishing existence of 
a solution to the inhomogeneous integral equation to the simpler problem 
of showing that the homogeneous integral equation allows only the trivial 
solution p = 0, and it extends the corresponding statement for systems 
of linear equations to the case of integral equations. Actually, Fredholm 
derived his results by interpreting integral equations as a limiting case of 
linear systems by considering the integral as a limit of Riemann sums and 
passing to the limit in Cramer’s rule for the solution of linear systems. For 
the solution of integral equations with continuous kernels, Fredholm’s ap- 
proach is still the most elegant and shortest. However, since it is restricted 
to the case of continuous kernels, it is more convenient to consider the 
above equations as a special case of operator equations of the second kind 
with a compact operator, as presented by Riesz in 1918. 



Definition 12.1 A linear operator A : X Y from a normed space X 
into a normed space Y is called compact if for each bounded sequence (tp n ) 
in X the sequence ( Ap n ) contains a convergent subsequence in Y , i.e., if 
each sequence from the set {Ap : p E X, ||^?|| < 1} contains a convergent 
subsequence. 
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Without developing the concept of compactness in normed spaces in any 
detail, we note that this definition is equivalent to requiring that the set 
{Aip : <p E X, ||(/?|| < 1} be relatively sequentially compact. 

Compact operators are bounded, linear combinations of compact opera- 
tors are compact, and products of two bounded operators are compact if 
one of them is compact (see Problem 12.2). From the Bolzano- Weierstrass 
theorem it can be seen that bounded operators A : X — ► X with finite- 
dimensional range A(X) := {Aip : ip E X} are compact. Furthermore, the 
identity operator I : X X, defined by I : ip h-» ip for all ip E X, is com- 
pact if and only if the space X is finite-dimensional. This actually justifies 
the distinction between the equations Aip = / and ip — Aip — f as equations 
of the first and second kind, since A and I — A have different properties in 
infinite-dimensional spaces if A is compact. A proof of these facts and of 
the following important theorem can be found in most introductory books 
on functional analysis, for example in [39]. 

The fundamental result of the Riesz theory is described by the following 
theorem, which extends Fredholm’s result on the equivalence of injectivity 
and surjectivity to the case of operator equations of the second kind with 
a compact operator. 

Theorem 12.2 Let A : X — > X be a compact operator in a normed space 
X. Then I — A is surjective if and only if it is injective. If the inverse 
operator (I — A)~ l : X — ► X exists , it is bounded. 

In order to verify that Fredholm’s existence analysis for integral equations 
with continuous kernels K : [a, b] x [a, b] — > JR can be viewed as a special 
case of Theorem 12.2, we have to establish that the linear integral operator 
A : C[a, 6] -» C[a, 6], defined by 

(A<p)(x) := f K(x,y)ip(y) dy, x€ [a, b], (12.3) 

J a 

is compact. For this we need the following theorem due to Arzela-Ascoli, 
which again is proven in most introductions to functional analysis. 

Theorem 12.3 (Arzela-Ascoli) Each sequence from a subset U C C[a,b] 
contains a uniformly convergent subsequence; i.e., U is relatively sequen- 
tially compact, if and only if it is bounded and equicontinuous, i.e., if there 
exists a constant C such that 



M*)l < c 

for all x E [a, b] and all ip € U , and for every e > 0 there exists S > 0 such 
that 

\ip(x) -<p(y ) I < e 

for all x, y E [a, b] with \x — y\ < S and all ip E U . 
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Theorem 12.4 The integral operator (12.3) with continuous kernel is a 
compact operator on C[a,b]. 

Proof. For all p G C[a, b] with ||^||oo < 1 and all x G [a, b], we have that 
IOV)(z)l < 0>-a) max \K(x,y)\; 

x,y£[a,b] 

i.e., the set U {Ap : p G C[a, 6], |M|oo < 1} C C[a,b] is bounded. Since 
K is uniformly continuous on the square [a, b] x [a, 6], for every e > 0 there 
exists S > 0 such that 



\K{x,z)-K{y,z)\<-£- 
b — a 

for all x,y,z G [a, b] with \x — y\ < 6. Then 



k^)(z) 



(A<f)(y) | = 



/' 



[K(x, z) - K(y,z)\p(z)dz 



< £ 



for all x, y G [a, b] with \x — y\ < 8 and all p G C[a, b] with ||^||oo < 1; be., 
U is equicontinuous. Hence A is compact by the Arzela-Ascoli Theorem 
12.3. □ 



In our analysis we also will need an explicit expression for the norm of 
the integral operator A. 

Theorem 12.5 The norm of the integral operator A : C[a, b] -* C[a, b] 
with continuous kernel K is given by 



Plloo = max f \K(x,y)\dy. (12.4) 

a<x<b J a 

Proof. For each p G C[a,b\ with ||^||oo < 1 we have 

\(Aip)(x)\ < f \K(x,y)\dy, xe[a,b ], 



and thus 



Halloo = 



sup 

IM|oo<l 



II^Mloo < max [ 
a<x<bj a 



\K(x,y)\dy. 



Since K is continuous, there exists xq G [a, b] such that 



f 



\K(x 0 ,y)\dy 



max/ \K(x,y)\dy. 

a<x<b J a 
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For e > 0 choose xj) £ C [a, b] by setting 



ipiy) ■= 



K(x 0 ,y) 

\K(x 0 ,y)\+e ’ 



y e [a, 6]- 



Then H^Hoo < 1 and 



Halloo > \{Alf>){x 0 )\ = 



f b [K(x 0 ,y)} 2 
a \K(x 0 ,y)\ + e 



dy > 



f b [K{xp,y)] 2 — e 2 
a \K(x 0 ,y)\ + e 



dy 




\K(x 0 ,y)\dy 



— e(b — a). 



Hence 



Plloo 



sup \\Aip II 

OO ^ 




\K(x 0 ,y)\dy - e{b- a), 



and since this holds for all e > 0, we have 



Plloo > f \K(xo,y)\dy 



max [ \K(x, y)\dy. 

a<x<b J a 



This concludes the proof. 



□ 



It also can be shown that the integral operator remains compact if the 
kernel K is merely weakly singular (see [39]). A kernel K is said to be 
weakly singular if it is defined and continuous for all x,y £ [a, b], x ^ t/, 
and there exist positive constants M and a £ (0, 1] such that 

\K(x,y)\ <M\x-y\ a ~ l 



for all x, y £ [a, b], x ^ y. 



12.2 Operator Approximations 

The fundamental concept for approximately solving an operator equation 



tp - Aip — f 



of the second kind is to replace it by an equation 

A n (Pn — fn 

with approximating sequences A n — > A and f n — > f as n -* oo. For com- 
putational purposes, the approximating equations will be chosen such that 
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they can be reduced to solving a system of linear equations. In this section 
we will provide a convergence and error analysis for such approximation 
schemes. In particular, we will derive convergence results and error esti- 
mates for the cases where we have either norm or pointwise convergence of 
the sequence A n — ► A, n — > oo. 

Theorem 12.6 Let A : X -» X be a compact linear operator on a Banach 
space X such that I — A is injective. Assume that the sequence A n : X — > X 
of bounded linear operators is norm convergent, i.e., \\A n — A\\ -» 0, n -> oo. 
Then for sufficiently large n the inverse operators (I — A n )~ x : X — » X 
exist and are uniformly bounded. For the solutions of the equations 

cp - Aip = / and <p n - A n (p n = f n 



we have an error estimate 

\Wn ~ <p\\ < C{\\(A n - AM + II /„ - /||} (12.5) 

for some constant C . 

Proof. By the Riesz Theorem 12.2, the inverse (I — A)~ l : X ->> X exists 
and is bounded. Since \\A n - A\\ -» 0, n -» oo, by Remark 3.25 we have 
||(/ - A)~ 1 (A n - A) || < q < 1 for sufficiently large n. For these n , by the 
Neumann series Theorem 3.48, the inverse operators of 

/ - (/ - A)~ l (A n - A) = (I - A)~ l (I - An) 

exist and are uniformly bounded by 

||[/-(/-^)- 1 (A n -^)]- 1 ||< r ^. 

But then [I - (I - A)~ 1 (A - A n )] -1 (I - A)~ l are the inverse operators of 
I — A n and they are uniformly bounded. 

The error estimate follows from 



(I — A n )((p n (f) — ( A A n )<p -f fn f 

by the uniform boundedness of the inverse operators (I — A n )~ l . □ 

In order to develop a similar analysis for the case where the sequence 
{ A n ) is merely pointwise convergent, i.e., A n (p -» tp, n oo, for all (p , we 
will have to bridge the gap between norm and pointwise convergence. This 
goal will be achieved through the concept of collectively compact operator 
sequences and the following uniform boundedness principle. 

Theorem 12.7 Let the sequence A n : X Y of bounded linear operators 
mapping a Banach space X into a normed space Y be pointwise bounded; 
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i.e., for each p £ X there exists a positive number C ^ depending on p 
such that ||A n (^|| < C ^ for all n G IN. Then the sequence ( A n ) is uniformly 
bounded; i.e., there exists some constant C such that ||A n || < C for all 
n G IN. 

Proof. In the first step, by an indirect proof we establish that positive 
constants M and p and an element can be chosen such that 

\\A n <p\\ < M (12.6) 

for all p £ X with \\p — < p and all n G IN. Assume that this is not 

possible. Then, by induction, we construct sequences (nk) in IN, ( pk ) in IR, 
and (pk) in X such that 

for k = 0, 1, 2, . . . and <p with \\p — pk\\ < pk and 

0 < Pk < ^ Pk—li II Tk ~ Tk — 1|| ^ 2 Pk — 1 



for k = 1, 2, 

We initiate the induction by setting no = 1, po = 1, and p 0 — 0. Assume 
that nk G IN , Pk > 0, and p k £ X are given. Then there exist Uk+\ G IN 
and p k + 1 G X satisfying \\p k +i ~ Vk\\ < Pk /2 and \\A nh+1 p k +i\\ > A; + 2. 
Otherwise, we would have ||A n <^|| < A;+2 for all p G X with \\p— pk\\ < pk / 2 
and all n G IN, and this contradicts our assumption. Set 



Pk + 1 := min 



Pk 1 

2 ’Pn fc+1 || 




Then for all p G X with ||<^ — ^+i|| < Pk+u by the triangle inequality we 
have 



ll^n fc+1 ¥>|| > WAn.+^k+lW - \\A nk+1 {<p - Vk+l)W >k + 1, 

since \\A nk+l {tp - y k+x )\\ < P„ fc+1 ||/o* +1 < 1. 

For j > k, using the geometric series we have 

II Tk - Pj II < || Pk - Vk+i\\ H 1- ll^j-i ~ Tj\\ 

^ 1 1 
~ 2 Pk ^ ^ 2 f)j ~ 1 - Pk ' 

Therefore, (pk) is a Cauchy sequence and converges to some element p in 
the Banach space X. From \\p k — pj\\ < p k for all j > k by passing to the 
limit j -» oo we see that || pk — p\\ < pk for all k G IN. Therefore, we have 
|| A nfc p\\ > k for all k G IN, which is a contradiction to the boundedness of 
the sequence (A n p). 
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Now, in the second step, from the validity of (12.6) we deduce for each 
p E X with \\p\\ < 1 and for all n £ IN the estimate 

\\AnV II = - \\A n ((xp + ip)- A n ip\\ < — . 

P P 

This completes the proof. □ 

Following Anselone [2], we introduce the concept of collectively compact 
operator sequences. 

Definition 12.8 A sequence A n : X -> Y of linear operators from a 
normed space X into a normed space Y is called collectively compact if 
each sequence from the set {A n p : ip £ X, ||y?|| < 1, n £ IN} contains a 
convergent subsequence. 

Each operator A n from a collectively compact sequence is compact. 

Lemma 12.9 Let X be a Banach space, let A n : X — » X be a collectively 
compact sequence, and let B n : X -» X be a pointwise convergent sequence 
with limit operator B : X -» X . Then 

||(7? n — i?)A n || — > 0, n — > oo. (12-7) 

Proof Assume that (12.7) is not valid. Then there exist e 0 > 0, a sequence 
(nk) in IN with -* oo, k -» oo, and a sequence (pk) in X with ||<^fc|| < 1 
such that 

lICBn* “ B)AnkPk II ^ k — 1,2,.... (12.8) 

Since the sequence ( A n ) is collectively compact, there exists a subsequence 
such that 

An k{j) Pk(j) 'ip E X, j y oo. (12.9) 

Then we can estimate with the aid of the triangle inequality and Remark 
3.25 to obtain 

W( B n kU) ~ B )An kU) Pk(j) II 

( 12 . 10 ) 

^ \\( B n k(j) ~ ^)'0|| + \\ B n k (j) ~ B \\ II An k(j) Pk(j) ~ ^11* 

The first term on the right-hand side of (12.10) tends to zero as j — > oo, 
since the operator sequence ( B n ) is pointwise convergent. The second term 
tends to zero as j — > oo, since the operator sequence ( B n ) is uniformly 
bounded by Theorem 12.7 and since we have the convergence (12.9). There- 
fore, passing to the limit j -» oo in (12.10) yields a contradiction to (12.8), 
and the proof is complete. □ 
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Theorem 12.10 Let A : X -> X be a compact linear operator on a Ba- 
nach space X such that I — A is injective , and assume that the sequence 
A n : X — > X of linear operators is collectively compact and pointwise con- 
vergent; i.e ., A n p Ap, n oo, for all p G X . Then for sufficiently 
large n the inverse operators (7 — A n )~ l : X — > X exist and are uniformly 
bounded. For the solutions of the equations 

ip Ap — f and p n A n p n = f n 
we have an error estimate 

\\<Pn - ¥>ll < C{IPn - AM + \\f n ~ /||} (12.11) 

for some constant C . 

Proof By the Riesz Theorem 12.2, the inverse (I - A) -1 : X X exists 
and is bounded. The identity 

{I- A)- 1 ==/ + (/- A)~ l A 



suggests 

M n :=/ + (/ — A)~ 1 A n 

as an approximate inverse for I — A n . Elementary calculations yield 

M n (I - A n ) — I - S n , (12.12) 



where 

S n :=(I-A)~ 1 (A n -A)A n . 

From Lemma 12.9 we conclude that ||5 n || ->0,n->oo. Hence for suffi- 
ciently large n we have ||5 n || < q < 1. For these n, by the Neumann series 
Theorem 3.48, the inverse operators ( I — 5 n ) _1 exist and are uniformly 
bounded by 

||(/-Sn)- 1 ||< r L_. 

Now (12.12) implies first that I — A n is injective, and therefore, since A n is 
compact, by Theorem 12.1 the inverse (7 — A n ) _1 exists. Then (12.12) also 
yields (7 — A n )~ l = (7 — S n ) -1 M n , whence uniform boundedness follows, 
since the operators M n are uniformly bounded by Theorem 12.7. The error 
estimate (12.11) is proven as in Theorem 12.6. □ 

Note that both error estimates (12.5) and (12.11) show that the accuracy 
of the approximate solution essentially depends on how well A n p approxi- 
mates Ap for the exact solution p. 
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12.3 Nystrom’s Method 



Recalling Chapter 9, we choose a convergent sequence 

Qn(g) = ]^4 n) 0(4 n) ) 

k = 0 



of quadrature formulae for the integral 




g(x) dx 



with quadrature points Xq 12 ^ , . . . , x^ G [a, b] and real quadrature weights 
a , . . . , . For convenience we write x 0 , . . . , x n instead of , . . . , x^ , 

and a 0 ,...,a n instead of a^, . . . , We approximate the integral oper- 
ator 

{Ay)(x) = / K{x,y)p(y)dy, x G [a, 6], 

V a 

with continuous kernel if by a sequence of numerical integration operators 



n 

(A n ip){x) := ^a fc /C(ar,a; fc )^(a;/fe), x € [a, 6]; 

k=0 



i.e., we apply the quadrature formulae for g = K(x, •)(/?. Then the solution 
to the integral equation of the second kind 



ip-A<p = f 



is approximated by the solution of 



( Pn A n (Pn — f ■) 

which reduces to solving a finite-dimensional linear system. 

Theorem 12.11 Let p n be a solution of 

n 

Tn{x) - ^2 akK(x,x k )if n (x k ) = f(x), x G [a, b]. (12.13) 

k = 0 



Then the values := <Pn(%j)> j — 0, at the quadrature points 

satisfy the linear system 

n 

Vj n) -^ a kK{xj,x k )ip ( p = f(xj), j = 0,...,n. 

k = 0 



(12.14) 
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Conversely , let <p^ n \ j = 0, . . . , n, be a solution of the system (12.14)- Then 
the function <p n defined by 

n 

Vn(x) ~ f{x) + ^a k K(x,x k )y { k ) , ®G[a,6], (12.15) 

k = 0 



solves equation (12.13). 

Proof. The first statement is trivial. For a solution ^ n \ j = 0, . . . ,n, of 
the system (12.14) the function <p n defined by (12.15) has values 

n 

<Pn (xj ) = f(xj) + ^2a k K (xj , x k )ip ( k n) = <^ n) , j = 0, . . . , n. 
k = 0 

Inserting this into (12.15), we see that p n satisfies (12.13). □ 

The formula (12.15) may be viewed as a natural interpolation of the val- 

(n) 

ues (fj , j = 0, . . . , n, at the quadrature points to obtain the approximating 
function (p n . It was introduced by Nystrom in 1930. 

For convenience we note the following analogue of Theorem 12.5. 

Theorem 12.12 The norm of the quadrature operators A n is given by 

n 

ll^nlloo = max ^ \a k K(x,x k )\. (12.16) 

a<x<b 

~ ~ k = 0 

Proof. For each ip E C[a,b] with ||<^||oo < 1 we have 

n 

II^WIloo < max S2\a k K(x,x k )\, 

a<x<b ' 

~ ~ k = 0 

and therefore HAnHoo is smaller than or equal to the right-hand side of 
(12.16). Let z E [a, 5] be such that 

n n 

~y]\a k K(z,x k )\ = max V] \a k K(x,x k )\ 

' ^ a<x<o ' ^ 

k = 0 “ “ k = 0 

and choose ^ E C[a, b] with H^Hoo = 1 and 

akK(z,Xk)^(xk) = \a k K(z,x k )\ , k = 0, . . . ,n. 



Then 

n 

Hindoo > ll^n^lloo > \{A n 1 p){z)\ = ^ \ a kK{z, Xfc)|, 

k = 0 

and (12.16) is proven. □ 



The error analysis will be based on the following theorem. 
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Theorem 12.13 Assume the quadrature formulae ( Q n ) to be convergent. 
Then the sequence (A n ) is collectively compact and pointwise convergent 
(i.e., A n <p — > Ap, n —> oo, for all ip G C[a,b]) but not norm convergent. 

Proof. Since the quadrature formulae (Q n ) are assumed to be convergent, 
by (9.13) and the uniform boundedness principle Theorem 12.7 there exists 
a constant C such that the weights satisfy 

k = 0 

for all n G IN (see Theorem 9.10). Then we can estimate 

IIA^Iloo < c max \K(x,y)\ ||^||oo (12.17) 

a<x,y<b 

and 

\{A n ip){xi) - (A n (p)(x 2 )\ < C max \K(x u y) - K(x 2 ,y)\ IMU (12.18) 

a<y<b 

for all X\,X 2 G [a,b\. From (12.17) and (12.18) we see that 

{A n p : ip G C[a,b\, Halloo < 1, n G IN} 

is bounded and equicontinuous because the kernel K is uniformly contin- 
uous on [a, b] x [a, b]. Therefore, by the Arzela-Ascoli Theorem 12.3 the 
sequence (A n ) is collectively compact. 

Since the quadrature is convergent, for fixed p G C[a,b\ the sequence 
(A n <p) is pointwise convergent; i.e., (A n p)(x) -» (Ap)(x), n -» oo, for all 
x G [a, b]. As a consequence of (12.18), the sequence (A n p) is equicontinu- 
ous. Hence it is uniformly convergent: || A n <p — ApWoo -» 0, n -> oo. That 
is, we have pointwise convergence: A n p -> Ap, n -» oo, for all p G C[a, b ] 
(see Problem 12.7). 

For e > 0 choose a function if) £ G C[a, b] with 0 < ip £ {x) < 1 for all 
x G [a, b] such that ip € (x) = 1 if minj = o,...,n \x — Xj\ > e and 'ifte(xj) — 0, 
j = 0 ,...,n. Then 

|| A(pip £ ) - AipWoo < max 

x,ye[a,b] 

for all p G (7 [a, b] with ||^||oo — 1- Using this result, we derive 

m--4n||oo= SUp ||(A - ^nVIloo > SUp SUp ||(^ - A n )((^V’ £ )l|oo 
IM|oo = l IMI«o = l E >0 

= sup sup ||A(^t/> e )||oo > sup Halloo = Plloo, 
llvl|oo = l llvll = = l 

whence we see that the sequence (A n ) cannot be norm convergent. □ 



\K{x,y)\ [ {1 

J a 



ipeiy)} dy -> 0, 



0, 
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Theorem 12.13 enables us to apply the approximation theory of Theorem 
12.10. For the discussion of the error based on the estimate (12.11) we need 
the norm || Acp — A n </?||oo. It can be expressed in terms of the error for the 
corresponding numerical quadrature by 

|| Atp - A n (f ||oo = max 



/ b n 

k{x, y)<p(y) dy -^2 a k K ( x > ) 

k = 0 



and requires a uniform estimate for the error of the quadrature applied 
to the integration of K(x, •)(/?. Therefore, from the error estimate (12.11), 
it follows that under suitable regularity assumptions on the kernel K and 
the exact solution </?, the convergence order of the underlying quadrature 
formulae carries over to the convergence order of the approximate solutions 
to the integral equation. We illustrate this by the case of the trapezoidal 
rule. Under the assumption ip G C 2 [a,5] and K G C 2 ([a,b] x [a, 6]), by 
Theorem 9.7, we can estimate 



\\Aip 






a ) max 

a<x,y<b 



dy 2 



[K(x,y)tp(y)]\. 



Example 12.14 Consider the integral equation 

<p(x) - i J (x+ l)e~ xy ip(y)dy = e ~ X ~\ + \ e~ {x+1} , 0 < x < 1, 

(12.19) 

with exact solution (p(x) = e~ x . For its kernel we have 

n 1 ? 9 ?, / + l )e~ xv dy = sup ^2 (1 - e~ x ) < 1. 

J o ^ 0<a;<l 



Therefore, by the Neumann series Theorem 3.48 and the operator norm 
(12.4), equation (12.19) is uniquely solvable. 

We use the (composite) trapezoidal rule for approximately solving the 
integral equation (12.19) by the Ny strom method. Table 12.1 gives the 
difference between the exact and approximate solutions and clearly shows 
the expected convergence rate 0(h 2 ). 

TABLE 12.1. Numerical solution of (12.19) by the trapezoidal rule 



n 


x — 0 


x = 0.25 


x = 0.5 


x = 0.75 


x = 1 


4 


0.007146 


0.008878 


0.010816 


0.013007 


0.015479 


8 


0.001788 


0.002224 


0.002711 


0.003261 


0.003882 


16 


0.000447 


0.000556 


0.000678 


0.000816 


0.000971 


32 


0.000112 


0.000139 


0.000170 


0.000204 


0.000243 
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We now use the (composite) Simpson’s rule for the integral equation 
(12.19). The numerical results in Table 12.2 show the convergence order 
0(h 4 ), which we expect from the error estimate (12.11) and the convergence 
order for Simpson’s rule from Theorem 9.8. □ 



TABLE 12.2. Numerical solution of (12.19) by Simpson’s rule 



n 


x = 0 


x = 0.25 


x — 0.5 


x = 0.75 


X — 1 


4 

8 

16 


0.00006652 

0.00000422 

0.00000026 


0.00008311 

0.00000527 

0.00000033 


0.00010905 

0.00000692 

0.00000043 


0.00015046 

0.00000956 

0.00000060 


0.00021416 

0.00001366 

0.00000086 



After comparing Tables 12.1 and 12.2, we wish to emphasize the major 
advantage of Nystrom’s method over other methods like the collocation 
method, which we will discuss in the next section. The matrix and the 
right-hand side of the linear system (12.14) are obtained by just evaluating 
the kernel K and the given function / at the quadrature points. Therefore, 
without any further computational effort we can improve considerably on 
the approximations by choosing a more accurate numerical quadrature for- 
mula. 

In the next example we consider an integral equation with a periodic 
kernel and a periodic solution. 



Example 12.15 Consider the integral equation 



/,\ a b 
( P{t ) H 

7 r 




( p(r)dr 

a 2 + b 2 — ( a 2 — b 2 ) cos (t + r) 



= /(0, 



0 < t < 2?r, 



(12.20) 

where a > b > 0. This integral equation arises from the solution of the 
Dirichlet problem for the Laplace equation in an ellipse with semiaxis a 
and b (see [39]). Any solution ip to the homogeneous form of equation 
(12.20) clearly must be a 27r-periodic analytic function, since the kernel is 
a 27r-periodic analytic function with respect to the variable t. Hence, we 
can expand ip into a uniformly convergent Fourier series 



oo oo 

= E a n cos nt + f3 n sin nt. 

n=0 n= 1 

Inserting this into the homogeneous integral equation and using the inte- 
grals (see Problem 12.10) 

ab f 2n e inT dr ( a-b 

7 r J 0 (a 2 -f b 2 ) — (a 2 — b 2 ) cos(t + r) \a + & 



n — int 



( 12 . 21 ) 
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for n — 0, 1, 2, . . . , it follows that 



r f a — b\ n 


II 

J* 




( a ~ b V 


l+ {a + b) J 


1 ~ 


\a + b) _ 



for n — 0, 1, 2, Hence, a n = f3 n = 0 for n = 0, 1, 2, ... , and therefore 

ip — 0. Now the Riesz Theorem 12.2 implies that the integral equation 
(12.20) is uniquely solvable for each right-hand side /. 

We numerically want to solve (12.20) in the case where the unique solu- 
tion is given by 



< p(t ) = e cost cos(sin£), 0 < t < 27r. 

Using the integrals (12.21), it can be seen that the right-hand side becomes 
f(t) = p(t) + e ccost cos(csin £), 0 < t < 2i r, 

where c— (a — b)/(a -f b). 

Since we are dealing with periodic analytic functions, we use the rectan- 
gular rule. From Theorem 9.28 we expect an exponentially decreasing error 
behavior, which is exhibited by the numerical results in Table 12.3 giving 
the difference between the exact and approximate solutions. Doubling the 
number of quadrature points doubles the number of correct digits in the 
approximate solution. 

TABLE 12.3. Nystrom method for equation (12.20) 





n 


t = 0 


t — 7t/2 


t — 'K 




4 


-0.15350443 


0.01354412 


-0.00636277 


a = 1 


8 


-0.00281745 


0.00009601 


-0.00004247 


6 = 0.5 


16 


-0.00000044 


0.00000001 


-0.00000001 




4 


-0.69224130 


-0.06117951 


-0.06216587 


a = 1 


8 


-0.15017166 


-0.00971695 


-0.01174302 


6 = 0.2 


16 


-0.00602633 


-0.00036043 


-0.00045498 




32 


-0.00000919 


-0.00000055 


-0.00000069 



The actual size of the error, i.e. , the constant factor in the exponential 
decay, depends on the parameters a and b y which describe the location of 
the singularities of the integrands in the complex plane; i.e., they determine 
the width of the strip of the complex plane into which the kernel can be 
extended as a holomorphic function. 

Note that for periodic analytic functions the rectangular rule generally 
yields better approximations than Simpson’s rule (see Problem 9.12). □ 
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We confine ourselves to these few examples for the application of the 
Nystrom method. For a greater variety the reader is referred to [1, 3, 6, 19, 
25, 30, 39, 49]. 

With the aid of appropriately chosen quadrature formulae, which take 
care of the singularity by a weighted product rule, the Nystrom method 
can also be successfully applied to weakly singular integral equations of the 
second kind (see [39]). 



12.4 The Collocation Method 

The collocation method for approximately solving an equation of the second 
kind 

<p — Aip = / (12.22) 

consists in seeking an approximate solution from a finite-dimensional sub- 
space by requiring that the equation (12.22) be satisfied at only a finite 
number of so-called collocation points. Assume that A : C[a,b] -» C[a,b] 
is a bounded linear operator and let X n = span {wg n \ . . . , C C[a,b] 
denote a sequence of subspaces with dim X n — n + 1. Choose n + 1 points 
a < 4 n) < ■ • • < x < b such that the interpolation at these grid 
points with respect to the subspace X n is uniquely solvable. Typical ex- 
amples for the choice of X n are polynomials, trigonometric polynomials, 
and splines (see also Problem 8.1). For convenience we will again write 
Xo , . . . , x n instead of Xq 1 ^ , . . . , xffl , and uq , . . . , u n instead of Uq 1 ^ , . . . , . 

By L n : C[a, b] -» X n we denote the operator that maps the function 
/ E (7 [a, b] into its uniquely determined interpolating function L n f E X n 
with the property 



(L n f)(xj) = f(xj), j = 0, . . . , n. 

Representing L n in terms of the Lagrange basis, i.e., in terms of the uniquely 
determined functions £o , • • • , £ X n with the interpolation property 

£ k {xj) =S jk , j,fe = 0,...,n, 



in the form 

n 

L »/ = £ /(**)** (12.23) 

k=0 

it can be seen that the operator L n : C[a,b] — ► X n is linear and bounded 
(with respect to the maximum norm). Moreover, since L n f — / for all 
/ E X n , the interpolation operator is a projection operator; i.e., L 2 n — L n 
(see p. 157 and Problem 8.4) 
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The collocation method approximates the solution of (12.22) by an ele- 
ment if n 6 X n satisfying 

<Pn(xj) - ( A<p n )(xj ) = j = 0, . . . ,n. (12.24) 

We express ip n as a linear combination 

n 

<Pn ~ 

k- 0 

and immediately see that equation (12.24) is equivalent to the linear system 

n 

^7 k{u k (xj) - (Au k )(xj)} = f(xj), j = 0, . . . ,n, (12.25) 

k= 0 

for the coefficients 70 , • • • , 7 n • If we use the Lagrange basis for X n and write 

n 

\ Pn — ^ ^ 7 k^kt 
k= 0 

then of course 7 j — (p n (xj), j = 0, . . . ,n, and the system (12.25) becomes 

n 

Ij - j = (12.26) 

From the systems (12.25) and (12.26) it is obvious that the collocation 
method is only semidiscrete, since in general, additional approximations 
are needed in order to compute the matrix entries ( Auk)(xj ) or {Aik){xj). 

The collocation method can be interpreted as a projection method ; i.e., 
since the interpolating function is uniquely determined by its values at the 
interpolation points, equation (12.24) is equivalent to 

( Pn L n Aip n — L n f. (12.27) 

This equation can be considered as an equation in the whole space C[a, b] 
because any solution (p n — L n A(p n + L n f automatically belongs to X n . 
Hence, our general error and convergence results for operator equations of 
the second kind can be applied to the collocation method. 

Theorem 12.16 Let A : (7 [a, b] -4 C[a,6] be a compact linear operator 
such that I — A is injective , and assume that the interpolation operators 
L n : C[a,b] — > X n satisfy || L n A — ^4||oo — > 0, n -> 00 . Then, for suffi- 
ciently large n, the approximate equation (12.27) is uniquely solvable for all 
f E C[a, b], and we have the error estimate 

II Tn ~ V?lloo < C\\L n ip - v?||oo (12.28) 

for some positive constant C depending on A. 
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Proof. From Theorem 12.6 applied to A n — L n A , we conclude that for 
all sufficiently large n the inverse operators (J — L n A)~ l exist and are 
uniformly bounded. To verify the error bound, we apply the interpolation 
operator L n to (12.22) and get 



f P L n Acp — L n f (p L n <p. 



Subtracting this from (12.27) we find 



(/ - L n A){ip n - ip) = L n tp - <p, 

whence the estimate (12.28) follows. □ 

Corollary 12.17 Let A : C[a,b\ C[a,b] be a compact linear operator 
such that I — A is injective, and assume that the interpolation operators 
L n : C[a, 6] -» X n are pointwise convergent; i.e., L n (p — ► y?, n — )• oo, for all 
(p € C[a, b\. Then, for sufficiently large n, the approximate equation (12.27) 
is uniquely solvable for all f € C[a,b\, and the estimate (12.28) holds. 

Proof. By Lemma 12.9 the pointwise convergence of the interpolation oper- 
ators L n and the compactness of A imply that \\L n A — AWoq — > 0, n ^ oo. 
Now the statement follows from the preceding theorem. □ 



We note that the collocation method may of course also be applied in 
function spaces other than the space C [a, b]. 

We proceed by considering the collocation method for integral equations 
of the second kind 



V(x)-f K(x,y)<p(y)dy = f(x), z€[a,£>], 



(12.29) 



with continuous kernel K. Using the interpolation operator, in this case we 
can rewrite the collocation equation (12.26) in the form 



Tn 




[L n K(- ,y)\(x)ip n (y)dy = (L n f)(x), 



x e [a, 6], 



and the systems (12.25) and (12.26) become 



(12.30) 



n 







K(xj,y)u k {y)dy 



f{xj), j = 0, . . . , n, (12.31) 



and 



7 j ~ 




K{xj,y)i k {y)dy = f(xj), j = 0,...,n, 



(12.32) 



respectively. There exists a broad variety of collocation methods corre- 
sponding to various choices for the subspaces X n , for the basis functions 




12.4 The Collocation Method 



305 



uo , . . . , u n , and for the collocation points xo, . . . , x n . We briefly discuss two 
possibilities, based on linear splines and on trigonometric polynomials. 

First we consider piecewise linear interpolation. Let xj — a + j/i, 
j = 0, . . . , n, denote an equidistant subdivision with step size h = (b — a) /n 
and let X n be the space of continuous functions on [a, 6] whose restrictions 
on each of the subintervals [xj-i,Xj], j = 1, . . . ,n, coincide with a linear 
function. As in Section 11.5, the Lagrange basis is given by 



40*0 := \ 



- {x -Zfc_i), 

(•£&+! x), 



0, 



x e [x k -i,x k ], 

x € [x k ,x k+ i], 
x <£ [x k -i,x k+1 \. 



for k = 0, . . . , n. Since for piecewise linear interpolation we have that 



\\Lnf\\ oo ^ . max \f(xj)\ < 

j=0 



with equality holding if / is constant, we observe that HLnlloo = 1 for the 
corresponding interpolation operator L n . Here, we have pointwise conver- 
gence L n tp -» ip, n — > oo. This can be seen from the error estimate (8.9) 
and the Weierstrass approximation theorem, analogous to the proof of the 
Szego Theorem 9.10. Therefore, in this case Corollary 12.17 applies, and 
we can state the following result. 

Theorem 12.18 The collocation method with linear splines converges for 
integral equations of the second kind with continuous kernels. 

Provided that the exact solution of the integral equation is twice contin- 
uously differentiable, then from the error estimate (8.9) for linear interpo- 
lation and Corollary 12.17 we derive an error estimate of the form 

II <Pn - Vlloo < WWooh 2 



for the linear spline collocation approximate solution (p n . Here, C denotes 
some constant depending on the kernel K. 

In general, in most practical problems the evaluation of the matrix entries 
in (12.32) will require a numerical quadrature for integrals of the form 
K(xj,y)£k(y) dy. To be consistent with our approximations, we replace 
K(xj , •) by its piecewise linear interpolation; i.e., we approximate 

rb n pb 

/ K(xj,y)e k (y)dy « 'Y^K{x j ,x i ) / ti{y)i k {y)dy 
J Q j_ 0 Ja 
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for j, k = 0, . . . , n. Straightforward calculations yield the tridiagonal matrix 

2 1 
1 4 1 
1 4 1 

1 4 1 
1 2 

for the weights w ik = £i{y)l k {y)dy. 

We now investigate the influence of these approximations on the error 
analysis. We interpret the solution of the system (12.32) with the approxi- 
mate values for the coefficients as the solution (p n of an additional approx- 
imate equation 

‘pn ~ A n (f) n — L n f) (12.33) 

namely of the collocation equation 

Pn{p^) — f \L n K n {' 5 y)](%)pn(y) dy — (Z/ n /)(x), cl<x 

J a 

with 

n 

K n (x,y) :=’Y^K{x,x i )li(y)\ 

2=0 

i.e., K n (x,y) = [L n K n (x,-)](y) interpolates K with respect to the second 
variable. We assume that the kernel K is twice continuously differentiable 
on [a, b] x [a, b]. Then, using the error estimate (8.9), we have 

\K(x,y) - K n (x,y)\ < y 

for all a < #, y < b. Writing 

K n (x,y ) - [L n K n (- ,y)]{x) = L n {K(x, •) - [L n K n (- ,-)]{x)}(y) 

and using the fact that for the piecewise linear spline interpolation we have 
1 1 L n | |oo = 1, from (8.9) we obtain 

\Kn(x, y) - [L n K n {- ,y)](x)\ < y 

for all a < x,y < b. Hence, in view of (12.4), for the integral operator A n 
with kernel K n we have \\A n - A||oo = 0(h 2 ). When / is twice continuously 
differentiable, we also have \\L n f — /||oo = 0(h 2 ). Therefore, from Theorem 
12.6 we can now conclude that the approximate equation (12.33) is uniquely 
solvable for sufficiently large n and that for the approximate solution we 
have an error estimate || (p n — <£>||oo = 0(h 2 ). Therefore, the fully discrete 
approximation still is of order 0(h 2 ). 
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Example 12.19 Consider the integral equation (12.19) of Example 12.14. 
Table 12.4 gives the error between the exact solution and the fully discrete 
collocation approximation with linear splines. It clearly exhibits the error 
behavior 0(h 2 ). □ 



TABLE 12.4. Numerical results for spline collocation 



n 


x = 0 


x = 0.25 


x = 0.5 


x = 0.75 


X = 1 


4 


0.004808 


0.005430 


0.006178 


0.007128 


0.008331 


8 


0.001199 


0.001354 


0.001541 


0.001778 


0.002078 


16 


0.000300 


0.000338 


0.000385 


0.000444 


0.000519 


32 


0.000075 


0.000085 


0.000096 


0.000111 


0.000130 



We note that in principle, a collocation method with error 0{h 4 ) can 
be obtained from cubic spline interpolation (see Theorem 8.34). However, 
the numerical implementation is much more involved. This again illustrates 
that the Nystrom method is more practical, since there it is quite easy to 
change the order from 0{h 2 ) to 0{h 4 ) by simply replacing the weights of 
the trapezoidal rule by those of Simpson’s rule. 

We proceed by discussing the collocation method based on trigonometric 
interpolation with equidistant knots tj = J7r/n, j — 0, . . . ,2n — 1. First, 
we establish a convergence result for the trigonometric interpolation of 
differentiable functions (see Problem 8.12). 

Lemma 12.20 Let f E C 1 [0,2tt]. Then for the remainder in trigonometric 
interpolation we have 



I|£n/-/Iloo<cj/'|| 2 , (12.34) 



where c n — ► 0, n — > oo. 

Proof. Consider the trigonometric monomials f m (t) = e imt and write m = 
{2k + l)n 4- q with k E TL and 0 < q < 2 n. Since f m {tj) = fq-n(tj) for 
j — 0, . . . , 2n — 1, the trigonometric interpolation polynomials for f m and 
fq- n coincide. Therefore, we have 



II Lnfm fm ||oo < 2 

for all |m| > n. Since / is continuously differentiable, we can expand it into 
a uniformly convergent Fourier series (see Problem 12.14) 



/ = 



oo 

^ ^ (Irnfrn • 



m— — oo 
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From the relation 

r2ir 



r2n p2n 

/ f , (t)e~ tmt dt — irri I f(t)e~ l7nt dt = 2mma 
Jo Jo 



for the Fourier coefficients it follows that 

o2tx 



11/ 



7||2 



/ \f'(t)\ 2 dt = 2ir Y] m 2 |a m | 2 . 

J® Tn — — r>^ 



Using this identity and the Cauchy-Schwarz inequality, we derive 

2 



^ -t 

II £»/ - /ll» < * ( E M [ < J ml E 

\m\—n 



This implies (12.34). 



□ 



Now, consider an integral equation of the second kind with 27r-periodic 
continuously differentiable kernel K and right-hand side /. The corre- 
sponding integral operator A maps C[0, 27 t] into C l [ 0, 27t] and satisfies 
|| H 2 < M|M|oo, where M = y/27r\\dK/dt\\oo. Therefore, making use 
of (12.34), we find 

ll^n-^^ ^^lloo ^ ^n||(A(/?) 1 1 2 ^ 1 1 I oo 

for all p E C[ 0, 2n]. Hence, \\L n A— A||oo < c n M — > 0, n 00 , and Theorem 
12.16 can be applied to obtain the following result. 

Theorem 12.21 The collocation method with trigonometric polynomials 
converges for integral equations of the second kind with continuously differ- 
entiable periodic kernels and right-hand sides . 

One possibility for the implementation of the collocation method is to 
use the trigonometric monomials as basis functions. Then the integrals 
/ Q 27r K(tj,r)e lkT dr have to be integrated numerically. Replacing the kernel 
by its trigonometric interpolation leads to the quadrature formula 

p 2 tt 2n—l 

/ K(tj,T)e ikT dTK- J2 /?(*,■, 

Jo n m=0 

for j = 0, . . . , 2n — 1. Using fast Fourier transform techniques (see Section 
8.2 ) these quadratures can be carried out very rapidly. A second, even 
more efficient, possibility is to use the Lagrange basis 

i k (t) = 1 1 + 2 V' cos m(t - t k ) + cosn(£ - t k ) 1 

2n l J 



(12.35) 
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for k = 0, . . . , 2n — 1 which can be derived from Theorem 8.25 (see Problem 
12.13). 

For the evaluation of the matrix coefficients J Q 27r K(tj, r)£k(r) dr we pro- 
ceed analogously to the preceding case of linear splines. We approximate 
these integrals by replacing K(tj , •) by its trigonometric interpolation poly- 
nomial, i.e., we approximate 



/*2tt 2n-l «2 tt 

/ K(tj, r)4(r) dr zs Y] K (tj, t m ) / £ m (r)4(r)dr 

Jo m = o y ° 



for j, k = 0, . . . , 2n — 1. Using (12.35), elementary integrations yield (see 
Problem 12.13) 



/•27T 

/ ^ m (r)4(r) dr = - <W - (-1)™-* _ , (12.36) 

for m, k = 0, . . . , 2n — 1. Note that despite the global nature of the trigono- 
metric interpolation and its Lagrange basis, due to the simple structure of 
the weights (12.36) in the quadrature rule, the computation of the matrix 
elements is not too costly. The only additional computational effort besides 
the kernel evaluation is the computation of the row sums 

2n— 1 

^(-1^(44) 

m = 0 

for j = 0, . . . , 2n — 1. We omit the analysis of the additional error in the 
fully discrete method caused by the numerical quadrature. 

Example 12.22 For the integral equation (12.20) from Example 12.15, 
Table 12.5 gives the error between the exact solution and the collocation 
approximation. 

TABLE 12.5. Collocation method for equation (12.20) 





n 


t = 0 


t = 7t/2 


t — 71 




4 


-0.10752855 


-0.03243176 


0.03961310 


a — 1 


8 


-0.00231537 


0.00059809 


0.00045961 


b = 0.5 


16 


-0.00000044 


0.00000002 


- 0.00000000 




4 


-0.56984945 


-0.18357135 


0.06022598 


a = 1 


8 


-0.14414257 


-0.00368787 


-0.00571394 


5 = 0.2 


16 


-0.00602543 


-0.00035953 


-0.00045408 




32 


-0.00000919 


-0.00000055 


-0.00000069 
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Again we have exponential convergence, as is to be expected from the 
estimate (12.28) and the error analysis for the trigonometric interpolation 
for analytic functions [38]. □ 

In general, the fully discrete implementation of the collocation method 
as described by our two examples can be used in all situations where the re- 
quired numerical quadratures for the matrix elements can be carried out in 
closed form for the chosen approximating subspace and collocation points. 
In all these cases, of course, the quadrature formulae that are required for 
the related Nystrom method will also be available. Because the approxima- 
tion order for both methods usually will be the same, Nystrom’s method 
is preferable, since it requires the least computational effort for evaluat- 
ing the matrix elements. However, the situation changes in cases where no 
straightforward quadrature rules for the application of Nystrom’s method 
are available. 

Again, for a greater variety of collocation methods the reader is referred 
to [1, 3, 6, 19, 25, 30, 39, 49]. 



12.5 Stability 

For finite-dimensional approximations of a given operator equation we have 
to distinguish three condition numbers, namely, the condition numbers of 
the original operator and of the approximating operator as mappings in 
the underlying normed spaces, and the condition number of the linear sys- 
tem for the actual numerical solution. This latter system we can influence, 
for example in the collocation method by the choice of the basis for the 
approximating subspaces. 

Consider an equation of the second kind ip — A<p = / in a Banach space 
X and approximating equations ip n — A n <p n — f n under the assumptions of 
Theorem 12.6, i.e., norm convergence, or of Theorem 12.10, i.e., collective 
compactness and pointwise convergence. Then, recalling Definition 5.2 of 
the condition number, from Theorems 12.6 and 12.10 it follows that the 
condition numbers cond (I — A n ) are uniformly bounded. Hence, for the 
condition of the approximating scheme, we mainly have to be concerned 
with the condition of the linear system for the actual computation of the 
solution of (p n - A n p n - f n . 

For the discussion of the condition number for the Nystrom method we 
recall the linear system (12.14) and denote by A n the matrix with the 
entries akK(xj, x^). We introduce operators R n : C[a, b] -» IR n+1 by 

Rn-f^ (f(x o), ■ ■ • , f(x n )) T , f € C[a, 6], 

and M n : lR n+1 — > C[a, 6], where M„$ is the piecewise linear interpolation 
with (M n $)(xj) = j = 0, . . . ,n, for $ = ($ 0 , ■ • • , $ n ) T - (If a < we 
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set (M n 4>)(x) = 4> 0 for a < x < xq] and if x n < 6, we set (M n 4>)(x) = 
for x n < x < b.) Then clearly, ||f? n ||oo = ll^nllcx) = 1. 

Prom Theorem 12.11 we conclude that 



(I - A n ) = R n (I - A n )M n 



and 

(7 - in)- 1 = R n (I - A n )~ 1 M n . 

From these relations we immediately obtain the following theorem. 

Theorem 12.23 For the Ny strom method the condition numbers for the 
linear system are uniformly bounded. 

This theorem states that the Nystrom method essentially preserves the 
stability of the original integral equation. 

For the collocation method, we introduce the matrices E n with entries 
Uk(xj ) and A n with entries ( Auk)(xj ). Since X n — span{uo, . . . ,u n } is 
assumed to be such that the interpolation problem with respect to the 
collocation points xo, . . . , x n is uniquely solvable, the matrix E n is invertible 
(see Problem 8.1). In addition, let the operator W n : ]R n+1 -» C[a,b] be 
defined by 

n 

Wn : 7 ^ J2 IkUk 
k=0 

for 7 = (7o,...,7n) T and recall the operators R n and M n from above. 
Then we have 

W n — L n M n E n . 

From (12.25) we can conclude that 



(E n — A n ) — R n L n (I — A)W n 



and 

(E n - An)- 1 = E ~ 1 R n (I ~ LnA)- l L n Mn . 

From these three relations, and the fact that by Theorems 12.7 and 12.16 
the sequence of operators (I — L n A)~ 1 L n is uniformly bounded, we obtain 
the following theorem. 

Theorem 12.24 Under the assumptions of Theorem 12.16 , for the collo- 
cation method the condition number of the linear system satisfies 

cond (E n - A n ) < CWLnWloCondEn 

for all sufficiently large n and some constant C. 
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This theorem suggests that the basis functions must be chosen with cau- 
tion. For a poor choice, like monomials, the condition number of E n can 
grow quite rapidly. However, for the Lagrange basis, i.e., for the linear sys- 
tem (12.26), E n becomes the identity matrix with condition number one. 
In addition, ||L n || enters in the estimate on the condition number of the 
linear system, and for example, for polynomial or trigonometric polynomial 
interpolation we have ||L n || -» oo, n -> oo (see Theorem 8.16). 

In the context of stability we will conclude this chapter with a few re- 
marks on integral equations of the first kind. 

Theorem 12.25 Let X and Y be normed spaces and let A : X -> Y be a 
compact linear operator. Then A has a bounded inverse if and only if X is 
finite- dimensional 

Proof. Assume that A has a bounded inverse A~ l :Y-^X. Then we have 
A~ l A = /, and therefore the identity operator must be compact, since the 
product of a bounded and a compact operator is compact (see Problem 
12.2). However, the identity operator on X is compact if and only if X has 
finite dimension. □ 

Theorem 12.25 implies that integral equations of the first kind with con- 
tinuous (or weakly singular) kernels are improperly posed problems in the 
sense of Hadamard, as described in Chapter 5. 

Of course, the ill-posed nature of an equation has consequences for its 
numerical treatment. The fact that an operator does not have a bounded 
inverse means that the condition numbers of its finite-dimensional approx- 
imations grow with the quality of the approximation. Hence, a careless dis- 
cretization of ill-posed problems leads to a numerical behavior that at first 
glance seems to be paradoxical. Namely, increasing the degree of discretiza- 
tion, i.e., increasing the accuracy of the approximation for the operator, will 
cause the approximate solution to the equation to become less and less re- 
liable. Therefore, straightforward application of the methods described in 
this chapter to integral equations of the first kind with continuous kernels 
will generate numerical nonsense. 

To make this remark more vivid, we consider the approximate solution 
of an integral equation of the first kind 

f K(x,y)<p(y)dy = f(x), x e [a,b], 

J a 

by the analogue of the linear system (12.14) for the Nystrom method, i.e., 

by 

n 

'^a k K(xj,x k )Vk i ' ) - f(xj), j = 0 

k = 0 
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The equation of the first kind 
1 

(x + l)e~ xy ip(y)dy = 1 — e -(x+1) , 0 < ar < 1, (12.37) 

has the unique solution ip(x) — e~ x (see Problem 12.20). Table 12.6 gives 
the difference between the exact solution and the solution obtained by the 
quadrature method using the (composite) trapezoidal rule. 



TABLE 12.6. Numerical solution of (12.37) 



n 


x — 0 


x = 0.5 


X — 1 


4 


0.4057 


0.3705 


0.1704 


8 


-4.5989 


14.6094 


-4.4770 


16 


-8.5957 


2.2626 


-153.4805 


32 


3.8965 


-32.2907 


22.5570 


64 


-88.6474 


-6.4484 


-182.6745 



We observe that the approximation is completely useless and that in 
agreement with the above remarks, the quality of the approximation de- 
creases when the accuracy of the quadrature is increased. (Of course, the 
actual numerical values for the solution of the ill-conditioned linear system 
of this example will depend on the actual computer and the code for solving 
the linear system that is used.) 

Hence, the numerical solution of integral equations of the first kind with 
continuous kernels requires regularization methods such as Tikhonov reg- 
ularization or singular value cutoff, which we discussed in Chapter 5 for 
the finite-dimensional case. These regularization techniques now, of course, 
need to be analyzed in an appropriate function space setting. We recall the 
corresponding references to [14, 22, 28, 37, 39, 43] from Chapter 5 for the 
foundation of regularization methods in Hilbert spaces. 



Problems 

12.1 Show that the boundary value problem for the differential equation 

—u" + qu = r in [0, 1] 

with boundary conditions w(0) = u( 1) = 0 is equivalent to finding a continuous 
solution of the integral equation of the second kind 
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where 

( (1 -x)y, 0 < y < x < 1, 

G(x,y) := < 

l (1 - y)x, 0 < x < y < 1, 

is the so-called Green’s function of the boundary value problem. 

12.2 Show that linear combinations of compact linear operators are compact 
and that the product of two bounded linear operators is compact if one of the 
factors is compact. 

12.3 Show that the integral operator with continuous kernel is a compact op- 
erator from L 2 [a,b] into L 2 [a,6]. 

12.4 Show that the Volterra integral equation of the second kind 

<p{x)~ K(x,y)tp(y)dy = f(x), x€[a,6], 

J a 

with continuous kernel K has a unique continuous solution ip for each continuous 
right-hand side /. 

Hint: Show that the homogeneous equation allows only the trivial solution and 
use Theorem 12.2. 

12.5 Solve the Volterra integral equation 

<p(x)~ f e x ~ v <p(y)dy = f(x) 

Jo 

by successive approximations. 

12.6 Show that a sequence A n : X -4 Y of compact linear operators mapping 
a normed space X into a normed space Y is collectively compact if and only if 
for each bounded sequence (ifn) in X the sequence ( A n (p n ) contains a convergent 
subsequence. 

12.7 Show that a sequence (ip n ) of functions (p n : [a, b] -» IR that is equicontin- 
uous and converges point wise on [a, b] to some function <p : [a, 6] -> IR converges 
uniformly on [a, b]. 

12.8 Prove the Banach- St einhaus theorem: Let A : X — > Y be a bounded linear 
operator and let A n : X — ► Y be a sequence of bounded linear operators from a 
Banach space X into a normed space Y. For pointwise convergence A n ip — > Aip, 
n — y oo, for all ip € X it is necessary and sufficient that ||A n || < C for all n £ IN 
with some constant C and that A n y> — »• Aip, n -* oo, for all <p E U, where U is 
some dense subset of X (compare Theorem 9.10). 

12.9 For the integral operator A and the numerical integration operators using 
the (composite) trapezoidal rule, derive bounds on ||(^4 n — A)^4||oo and 
\\(A n — A)A n \\oo- Relate the results to Lemma 12.9. 



12.10 Verify the integrals (12.21). 
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12.11 Write a computer program for the Nystom method allowing the use of 
different quadrature formulae and test it for various examples. 

12.12 Use the quadrature formula (9.36) with the substitution (9.47) in a 
Nystrom method for the integral equation (12.19). Compare the numerical re- 
sults with those obtained from the trapezoidal and Simpson’s rule. 

12.13 Verify the Lagrange basis (12.35) and the integrals (12.36). 

12.14 Show that the Fourier series of a continuously differentiable periodic func- 
tion is uniformly convergent. 

12.15 In the degenerate kernel approximation the integral equation of the sec- 
ond kind with continuous kernel K is approximated by the solutions of 

K n (x,y)ifi n (y)dy = f(x), x€[a,6], 

J a 

with an approximate kernel K n of the form 

n 

K n (x,y) - 

3 = 0 

Show how the solution of the approximate equation can be reduced to solving 
a system of linear equations. Give an error and convergence analysis based on 
Theorem 12.6. 

12.16 Use the results of Problem 12.15 to prove Theorem 12.2 for the case of 
an integral equation of the second kind with continuous kernel. 

12.17 Construct degenerate kernels via interpolation of the kernel K with re- 
spect to the first variable and relate this particular degenerate kernel method to 
the collocation method (see Problem 12.15). 

12.18 The idea of two- grid and multigrid iterations can also be applied to 
integral equations of the second kind. For its theoretical foundation assume 
the sequence of operators A n : X — > X to be either norm convergent (i.e., 
|| A n — A|| — > 0, n — > oo) or collectively compact and pointwise convergent (i.e., 
A n p — » A(p, n — » oo, for all tp G X). Show that the defect correction iteration 

[ Pn,v + l •— (1 An— l) {(-^n A n — l)^n,i/ ~f* fn } 5 I'' — 0 , 1 , 2 , ... , 

using the preceding coarser level converges, provided that n is sufficiently large. 
Show that the defect correction iteration 



Pn,u+\ • — (1 Ao) {(An Ao)(/?n,i/ 4“ /n}? ^ — 6 , 1 , 2 , ... , 

using the coarsest level converges, provided that the approximation Ao is suffi- 
ciently close to A. 
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12.19 Consider the two- grid iteration 

(-1 -dm) {(-dn /n}> 1/ = 0, 1 , 2, . . . , 

with m = n — 1 or m = 0 for the Nystrom method, i.e., for the numerical 
quadrature operators 



{A„<p)(x) = a^Kix, x 6 [0, 1], 

k = 1 

with z n quadrature points. Show that each iteration step requires the following 
computations. First 

Qn, v •— fn "h (-dn 

has to be evaluated at the z m quadrature points Xj m \ j = 1, . . . , z m , on the level 
m and at the z n quadrature points Xj n \ j = 1 , . . . , z n , on the level n by setting 
x = Xj and x = x^ n \ respectively, in 

Z n z m 

9n,„(x) - fn(x) + Y^a ( k n) K(x,x ( j^ ) )(p„, v {x ( ^ ) ) 

k = 1 k—1 

Then one has to solve the linear system 



= UnAx?'), i = 1. • • • 



i Zm ? 



for the values <£ n ,^+i (^ m) ) at the quadrature points a^ n \ Finally, the values 
at the z n quadrature points x^\ j = 1, . . . , z n , are obtained from the Nystrom 
interpolation 



<p n ,„ +1 (x^)='£ / ai m) K(x? 



•'+!(** m) ) +9n,v(x j n) ), 



j = 1, 



k = l 

Make an operation count for one step of the defect correction iteration. Set up 
the corresponding equations for the collocation method. 



12.20 Show that the integral equation (12.37) has a unique solution. 
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Cauchy-Schwarz inequality, 30 
Cea’s lemma, 273 
characteristic polynomial, 36, 249 
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classical Jacobi method, 131 
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collectively compact operators, 294 
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compact operator, 288 
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computer-aided geometric design, 
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condition number, 80 
conjugate gradient method, 285 
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contraction operator, 43 
convergence order, 108, 238 
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convex hull, 181 
convex set, 98 
cyclic Jacobi method, 132 

de Casteljau algorithm, 183 
defect correction iteration, 69 
defect correction principle, 69 
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diagonal matrix, 16 
diagonalizable matrix, 133 
diagonally dominant, 56 
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weakly, 59 

difference equation, 248 
stable, 248 

direct methods, 5, 119 
discrepancy principle, 85 
distance, 27 
divergent sequence, 27 
divided differences, 154 

eigenvalue, 36 
eigenvector, 36 
elimination methods, 5 
equicontinuous, 289 
equivalent linear system, 12 
equivalent norm, 27 
Euclidean norm, 26 
Euler method, 231 
implicit, 233 
improved, 234 

Euler-Maclaurin expansion, 209 
explicit method, 233 
extrapolation method, 212, 216 

fast Fourier transform, 167 
Fibonacci numbers, 256 
finite difference method, 262 
finite element method, 279 
fixed point, 43 
forward differences, 182 
forward elimination, 13, 14 
Fourier series, 52 
Fourier transform 
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fast, 167 

Fredholm integral equation, 287 
first kind, 287, 312 
second kind, 287 
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frozen Newton method, 109 
fully discrete method, 274 
function space 
C[a,6], 40 
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composite, 207 
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global convergence, 95 
global error, 238 
maximal, 238 

Gram-Schmidt orthogonalization, 
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nomial, 186 
Hermitian matrix, 37 
Hessenberg matrix, 144 
Hessian matrix, 114 
Heun method, 234 
Hilbert matrix, 79 
Hilbert space, 40 
Horner scheme, 110 
Householder matrix, 20 
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ill-conditioned linear system, 81 
ill-posed problem, 77 
implicit method, 233 
initial value problem, 228 
injective operator, 46 
inner product, 29 
interpolation operator, 157, 302 
trigonometric, 169 
interpolation polynomial 
Hermite, 160 
Hermite-Birkhoff, 186 
Lagrange, 153 
Newton, 155 
trigonometric, 163 
interpolatory quadrature, 190 
inverse interpolation, 186 
irreducible matrix, 59 
iterative methods, 5, 119 

Jacobi method, 55 
classical, 131 
cyclic, 132 
damped, 71 
with relaxation, 61 
Jacobian matrix, 99 

kernel, 287 

degenerate, 315 
weakly singular, 291 

Lagrange factor, 153 
Lagrange interpolation polynomial, 
153 

least squares method, 10 
left triangular matrix, 18 
Legendre polynomial, 205 
Levenberg-Marquardt method, 114 
limit, 27 

linear convergence, 108 
linear interpolation, 158 
linear operator, 32 
linear system 

equivalent, 12 
triangular, 12 
Lipschitz condition, 228 



Lipschitz constant, 228 
Lipschitz continuous, 43 
local convergence, 95 
local discretization error, 235, 245 
Lotka-Volterra equations, 255 
lower triangular matrix, 18 
LR decomposition, 18 
L\ norm, 41 
Z /2 norm, 42 

Mandelbrot set, 118 
matrix 

adjoint, 6 

consistently ordered, 64 
diagonal, 16 
diagonalizable, 133 
Hermitian, 37 
Hessenberg, 144 
Hessian, 114 
Hilbert, 79 
Householder, 20 
irreducible, 59 
Jacobian, 99 
left triangular, 18 
lower triangular, 18 
normal, 127 
permutation, 19 
positive definite, 19, 37 
positive semidefinite, 37 
reducible, 59 
right triangular, 18 
symmetric, 19 
transposed, 6 
tridiagonal, 7 
unitary, 20 
upper triangular, 18 
Vandermonde, 186 
matrix norm, 34 
maximum norm, 26, 41 
mean value theorem, 99 
midpoint rule, 206 
Milne-Thomson method, 245 
modified Newton method, 109 
Moore-Penrose inverse, 84 
multigrid methods, 74 
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multiplicity 

algebraic, 36 
geometric, 36 
multistep method, 243 
stable, 251 

Neumann series, 46, 51 
Neville scheme, 156 
Newton interpolation polynomial, 
155 

Newton method, 102 
frozen, 109 
modified, 109 

Newton-Cotes quadrature, 191, 222 
norm, 26 

equivalent, 27 
Euclidean, 26 
Frobenius, 127 
41 
£ 2 , 42 

maximum, 26, 41 
stronger, 50 
vector, 26 

normal equations, 49 
normal matrix, 127 
normed space, 26 
Nystrom method, 245, 296 

open ball, 29 
open set, 28 
operator, 32 

bijective, 46 
bounded, 33 
compact, 288 
continuous, 32 
contraction, 43 
injective, 46 
linear, 32 

strictly coercive, 269 
surjective, 46 
operator norm, 33 
ordinary differential equation, 226 
orthogonal, 31 
orthogonal projection, 48 
orthogonal system, 31 



orthonormal system, 31 

Parseval equality, 52 
partial pivoting, 15 
Peano kernel, 221 
permutation matrix, 19 
pivot element, 14 
pivoting 

complete, 15 
partial, 15 

polygon method, 231 
polynomial 

Bernoulli, 207 
Bernstein, 180 
Chebyshev, 204, 223 
Legendre, 205 

positive definite matrix, 19, 37 
positive semidefinite matrix, 37 
power method, 133 
predictor corrector method, 234 
pre-Hilbert space, 29 
projection method, 272, 303 
pseudo-inverse, 84 

QR algorithm, 133 
deflation, 144 
shift, 144 

QR decomposition, 19 
quadratic convergence, 108 
quadrature 

Chebyshev, 223 
convergent, 198 
Gauss-Chebyshev, 205 
Gauss-Legendre, 205 
Gauss-Lobatto, 223 
Gauss-Radau, 223 
Gaussian, 201 
interpolatory, 190 
Newton-Cotes, 191, 222 
Romberg, 213 
quadrature points, 190 
quadrature weights, 190 

range, 32 

rank one methods, 110 
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Rayleigh-Ritz method, 285 
rectangular rule, 210 
reducible matrix, 59 
regularization parameter, 86 
relaxation methods, 60 
relaxation parameter, 61 
Riesz theory, 289 
right triangular matrix, 18 
Romberg quadrature, 213 
root condition, 249 
Runge-Kutta method, 241 

Sassenfeld criterion, 57 
scalar product, 29 
scaling, 16 

Schur’s inequality, 127 
secant method, 110 
semidiscrete method, 274 
series, 50 

sesquilinear function, 270 
bounded, 270 
strictly coercive, 270 
shooting method, 258 
multiple, 261 
Simpson’s rule, 192 
composite, 196 

simultaneous displacements, 55 

single-step method, 234 

singular system, 82 

singular value decomposition, 82 

singular values, 81 

Sobolev space, 275 

span, 31 

spectral cutoff, 85 
spectral radius, 38 
spline, 169 

cubic, 170, 175 
spline interpolation, 169 
steepest descent, 115 
Steffensen’s method, 117 
strictly coercive operator, 269 
stronger norm, 50 
Sturm-Liouville problem, 274 
successive approximations, 44 
successive displacements, 57 



successive overrelaxation method, 
62 

superlinear convergence, 110 
surjective operator, 46 
symmetric matrix, 19 

theorem 

Arzela-Ascoli, 289 
Courant, 123 
Faber, 160 
Gerschgorin, 126 
Kahan, 62 
Lax-Milgram, 269 
Marcinkiewicz, 159 
Ostrowski, 63 
Picard-Lindelof, 228 
Rayleigh, 122 
Riesz, 268 
Steklow, 199 
Szego, 198 
Young, 64 

Tikhonov regularization, 86 
transposed matrix, 6 
trapezoidal rule, 192 
composite, 196 
triangle inequality, 26 
second, 26 

triangular linear system, 12 
tridiagonal matrix, 7 
trigonometric interpolation poly- 
nomial, 163 

trigonometric polynomial, 162 
two-grid methods, 68 

uniform boundedness principle, 292 

unitary matrix, 20 

upper triangular matrix, 18 

Vandermonde matrix, 186 
vector norm, 26 
Verhulst equation, 227 
Volterra integral equation, 228, 314 

weak derivative, 275 
well-conditioned linear system, 81 
well-posed problem, 77 
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